DOCUHENT RESUME 

BD 107 161 - PL 006 940 



&OTHOH 
TITLE 

INSTITUTION 
PUB DATE 
NOTE 

AVAILABLE FROH 



EDBS PRICE 
DESCRIPTORS 



Jones, Randall L«# Ed«; Spolsky, Bernard, Ed. 
Testing Language Proficiency. 

Center for Applied Linguistics, Washington, D«C« 
75 

152p. 

Center for Applied Linguistics, 1611 North Kent 
Street, Arlington, Virginia 22209 {$7.95) 

KF-tO.ie HC-$8.24 PLUS POSTAGE 
^Conference Reports; Curriculum Guides; Language 
Ability; Language Fluency; ^Language Proficiency; 
^Language Skills; ^Language Tests; Linguistic 
Perforaance; Listening CoBprehension; Listening 
Tests; Oral CoBiunication; Reading Coaprehension; 
Reading Tests; Test Construction; ^Testing; Testing 
Problems; Test Validity 



ABSTRACT 

This publication is a conpilation of t»he papers 
presented at the 1974 Washington Language Testing Suaposium. The 
voluoe also includes such of the discussion that followed each paper. 
The participants were an international group of language testing 
specialists froa academic institutions, research centers, and 
governaent agencies. The priaary focus of the syaposiua vas language 
proficiency testing, especially as it relates to the use of foreign 
languages on the job. The papers are organized under four headings: 
(1) Testing Speaking Prof iciency— •"Testing Language Proficiency in 
the Onited States Government," R. L» Jones; "Theoretical and 
Technical Considerations in Oral Proficiency Testing," J. L. D. 
Clark; "The Oral Interview Test," c. P. iilds; (2) Testing Listening 
Comprehension~"Testing Coaaanicative Coar^tence in Listening 
Coaprehension," P. J* H. Groot; "Reduced Redundancy Testing; A 
Progress Report," L. Gradman and B. Spolsky; "Dictation: A Test of 
Grammar Based Expectancies," J. Oiler, Jr. and V. Streiff; (3) 
Testing Reading Comprehension — "Contextual Testing," J. Bondaruk, J* 
Child, and E« Tetrault; "Some Theoretical Problems and Practical 
Solutions in Proficiency Test Validity," C. R. Petersen and A. 
Cartier; "Two Tests of Speeded Reading," A. Davies; (4) Other 
Considerations**"Problems of Syllabus, Curriculum, and Testing in 
Connection with Modern Language Programmes for Adult Europe," G. 
Nickel. The concluding statement, by B. Spolsky, and a list of 
contributors to the conference are also provided. (Author/AH) 



Tfesting 
Language 
Praficieocy 

U$ OEPARTMENTOFHCALTH. 
EOUCATIQN ft WELFARE 
NATIQHAL INSTITUTG OF 
EDUCATION 

This OCfovENT HAS BEEN REPRO 
OUCEO £XACTLV AS RECE<vED FROM 
The person or ORCAN'ZATfON origin 
ATiNO IT POiNTSOP VIEW OR OPINIONS 
STATED DO NOT NECESSARilv REPRE 
SEST OFFICIAL NATIONAL INSTITUTE OP 
EOuCATtON POSITION OR POliCv 



Edited by Randall L Jones 
and 

Bernard Spolsky 



Center for y^jplied Linguistics 



o 
o 




(^opyri^lU© 1975 

by the Center for Applied Linguistics 
1611 North Kent Street 
Arlinj»ton. Virginia 22209 

ISBN: 87281-040-2 

Library of Conj»ress Cataloj» Card Number: 75M3740 
Printed in the United States of America 




ERLC 



Preface 

Randall L. Jones and Bernard Spolsky 



The 1974 Washington Language Testing Symposium was the natural 
result of cooperation between two recently established groups whose 
primary concern is language testing. The Testing Subcommittee of the 
United States Government Interagency Language Roundtable was 
organized in 1972. Its principal function is to coordinate research and 
development of language tests among the various U.S. Government 
language schools. The Commission on Language Tests and Testing was 
formed at the Third International Congress of Applied Linguistics in 
Copenhagen in August 1973 as part of the International Association 
of Applied Linguistics. Among the tasks assigned to the Commission 
was "to organize specialized meetings on tests and testing at a time 
other than the regular AILA Congress." In filling this task, it attempted 
to provide a continuation to a series of conferences on language test- 
ing which had already taken place, including the 1967 ATESL Seminar 
on Testing (Wigglesworth 1967). the 1967 Michigan conference (Upshur 
and Fata 1968). and the 1968 conference at the University of Southern 
California (Briere 1969). The first such meeting was organized in 
conjunction with the 1973 TESOL Convention; some of the papers 
presented there have just been published (Palmer and Spolsky 1975). 
A second meeting was held in Hasselt. Belgium in September 1973. 

The papers in this volume represent the third of these meelings. 
The participants were language testing specialists from academic 
institutions, research centers, and U.S. and other government agen- 
cies. The primary focus of the symposium was language proficiency 
testing, especially as it relates to the use of foreign languages on the 
job This volume includes not only the papers that were presented, 
but also much of the discussion that followed each paper. It thus 
provides a useful picture of the state of language proficiency testing, 
and illustrates as uell the possibilities which emerge when prac- 
titioners and theorists meet to discuss their common problems. 

Many people contributed to the success of the conference. Special 
thanks are due to the members of the Testing Subcommittee of the 
U.S. Government Interagency Language Roundtable who contributed 
financial support (the Foreign Service Institute of the Department of 
Stale, the Defense Language Institute of the Department of Defense; 
the Office of Education of the Department of Health. Education and 
Welfare; the Central Intelligence Agency; and the National Security 



ERIC 



Hi 



iV Testing Languoge Proficiency 



Agency), to Georgetown University for hosting the conference, to the 
Center for Applied Linguistics for their financial support as well as 
their willingness to publish the proceedings, and to all the partici- 
pants, many of whom came from great distances to be present. We are 
most grateful to Allene Guss Grognet and Marcia E. Taylor of the 
Center for Applied Linguistics for the great assistance they provided 
in preparing this volume for publication. 

REFERENCES 

Bri^re. Eugene. "Current Trends in Second Language Testing." TESOl Quarterly 3:4 
(December 1969). 333-340. 

Wlgglesworth. David C. (ed.|. SeJected Conference Papers of the Association of Teach- 
ers of English os o Second Longuoge. Washington. D.C.: NAFSA. 1967. 

Upshur. John A. and Julia Fata (eds.). Problems in Foreign Longuoge Testing. Longuoge 
Leorning. Special Issue No. 3 (1968). 
• Palmer. Leslie and Bernard Spolsky (eds.). Popers on Longuoge Testing 1967-74. Wash- 
ington. DC: TESOL, 1975. 



ERLC 



G 



Table of Contents 



Preface Hi 
Randall L. Jones and Bernarjd Spolsky 

TESTING SPEAKING PROFICIENCY 

Testing Language Proficiency in the United States 1 
Government 

Randall L. Jones ^ 

Theoretical and Technical Considerations in Oral 10 
Proficiency Testing 
John L.D. Glark 

The Oral Interview Test 29 
Claudia P. Wilds 

TESTING LISTENING COMPREHENSION 

Testing Communicative Competence in Listening 45 

Comprehension 

Peter J.M. Groot 

Reduced Redundancy Testing: A Progress Report 59 
Harry L. Gradman and Bernard Spolsky 

Dictation: A Test of Grammar Based Expectancies 71 
John W. Oiler, Jr. and Virginia Streiff 

TESTING READING COMPREHENSION 

Contextual Testing 89 
John Bpndaruk, James Child and E. Tetrault 

Some Theoretical Problems and Practical Solutions in 105 

Proficiency Test Validity 

Calvin R. Petersen and Francis A. Cartier 

Two Tests of Speeded Reading 119 
Akn Davies 

OTHER CONSIDERATIONS IN TESTING 

Problems of Syllabus, Curriculum and Testing in Connection 131 
with Modern Language Programmes for Adult Europe 
Gerhard Nickel 

Concluding Statement 139 
Bernard Spolsky 

List of Contributors 145 



V 



Testing Language Proficiency in the United States 
Government 

Randall L Jones 



Of the thousands of sludenls enrolled in foreign language courses in 
the United States, onh d relati\eh snuill percentage are associated 
with Government language training programs. Yet this minor segment 
of the language learning population is unusual and potentially signifi- 
cant for the language teaching profession as a whole. The students in 
lis. Government language schools are exclusiveh adults who are 
IcMrning a language because it is important for <i position thev are 
either occupving or are about to be placed in. Man\ of them have al- 
ready learned a second language and have used it in the country 
where it is spoken. The\ are probablv enrolled in full-time courses 
which last for six to twelve months. And perhaps most important, the 
majority of them will have occasion to use the language frequently 
soon after the end of the training period. The conditions for language 
le irning are close to ideal, and certainh useful for doing research and 
experimentation. 

Positions in federal agencies for which knowledge of a second lan- 
guage is required are referred to as "language-essential." Becaus<3 the 
degree of proficienc> does not need to be the same for all positions, it 
is necessarv to define levels of proficienc\ and to state the minimum 
level for any language-essential position. Such a s\stem obviously 
necessitates a testing program that can accurately assess the ability of 
an individual to speak, understand, read or write a foreign language, 
and that can assign a proficiency score to that person which will indi- 
cate whether he is qualified to assume a specified language-essential 
position The outcome of such a test ma\ well have a significant af- 
fect on the career of the indivuiual. 

In 1968 an ad hoc interagency committee (with representatives from 
the Foreign Service Institute (FSl), the [defense Language Institute 
IDIJ), the National Security Agency [NSA], the Central Intelli- 
gence Agency (CIA), and ihe Civil Service Commission [CSC]) 
mot to discuss the standardization of language scores for government 
agencies. The committee proposed a svstem which would provide for 
the recording of language proficiency in four skills: speaking, listening 
comprehension, reading and writing. It was decided that degrees of 
proficiency in each of these skills could be represented on an eleven 



ERLC 



3 

8 



2 Testing Longuoge Proficiency 



point scale, from 0 to 5. with pluses for levels 0 through 4. A set of 
definitions was prepared for the four skills at each of the principal 
levels (1-5). (The tlefinition for speaking is essentially the same as had 
already been in use at FSI prior to 1968.) 

The scale and definition proposed by the ad hoc committee have 
been ado|)ted b\ the members of the Interagency Language Round- 
table of the United States Government for use in their respective 
lanj^uage training programs. However, a number of questions relating 
to standards of testing, test development, test techni(jue, test valida- 
tion, etc. still remain to be answered. The Roundtable's recently estab- 
lished Subcommittee on Testing has been given the task of dealing 
with these |)roblems. man\ of which are certainly not peculiar to 
government language programs and are, of course, not new to the 
language teaching profession as a whole \V* felt that it was therefore 
appro|)nate to convene a meeting of both government and non-govern- 
ment language testmg specialists to discuss them. The members of the 
panel possess broad antl varied backgrounds in the field of language 
testing. Thev rejiresent government-affiliated language programs as 
well as academic institutions in the United States, Canada and 
Europe. Our focus is narrow. V\e vvill not be discussing language test- 
ing in all of its forms, but only the testing of language proficiency- 
an individual's demonstrable competence to use a language skill of 
one type or another, regardless of how he may have acquired it. 

In |)Ianning for the svmposiiim we had four objectives in mind: 
(I) to determine the state of the art of language proficiency testing 
within the U.S. Goveinment, (2) to discuss common problems relating 
to language testing, (3) to ex[)lore new ideas and techniques for test- 
ing, and (4) to establish a future clirection for research and develop- 
ment. We are not operating under the delusion that any of these ob- 
jectives will be completelv met. It simply vvill not be possible to sur- 
face and dihcuss all of theproblems concerning language proficiency 
testing, let alone fintl adequate solutions for them. Furthermore, we 
realize that although we are dealing with an imperfect svstem. it may 
not be possible to alter it a great deal under the circumstances. We will 
sim|)Iv have to learn to live with some of its imperfections. But we also 
feel an obligation to review our program carefully and to attempt to 
make improvements where it is possible to do so. We are optimistic 
that new ideas vvill emerge from this forum which will aid all of us in 
devising more accurate means of esting language [)roficiency. 

The three skills which are most often tested at Government language 
schools are speaking, listening comprehension and reading. You will 
recall that the scores on our proficiencv tests are supposed to in some 
way reflect language competence as described bv the Civil Service 
definitions. In order to clarifv the criteria for evaluation we are deal- 




Testing Language Proficiency in the United Stales Government 3 



\ng with. I will give the definitions for level 3, or the minimum pro- 
fessional level, for each of the three skills: 

The level "3" speaker should be: 

Able to speak the language with sufficient structural accuracy 
and vocabulary to participate effectively in most formal and 
informal conversations on practical, social, and professional 
topics. Can discuss particular interests and special fields of 
competence with reasonable ease: comprehension is (juile 
complete for a normal rate of speech: vocabulary is broad 
enough that he rareh has to grope for a word: accent may be 
obviously foreign: control of grammar good: errors never in- 
terfere with understanding and rarely disturb the native 
speaker. 

In terms of listening comprehension, the individual at level "3" is: 

Able to understand the essentials of all speech in a standard 
dialect, including technical discussions within a special 
field. Has effective understanding of face-to-face speech, 
delivered with normal clarit\ and speed in a stanc^ird dialect, 
on general topics and areas of special interest, has broad 
enough vocabulary that he rarely has io ask for jiaraphrasing 
or e.xplanation: can follow accurately the essentials of conver- 
sations between educated native speakers, reasonably clear 
telephone calls, radio broadcasts, and public addresses on 
non-technical subjects: can understand without difficulty all 
forms of standard speech concerning a specie 1 professional 



At the **3" level for leading [)roficiency. a person is: 

Able to read standard newspaper items addressed to the gen- 
ercd reader, routine correspondence, reports and technical 
material in his special field. Can grasp the essentials of 
articles of the above types without using a dictionary; for 
accurate understanding moderately frequent use of a dic- 
tionary is required. Has occasional difficulty with unusually 
complex strucJures and low-frequency idioms. 

If these definitions are to be taken seriously, we must be satisfied 
that anyone who is tested and assigned a proficiency rating can meet 
the criteria for that level. One of the principal problems we are faced 
with is the construction of proficiency tests which measure language 
ability accurately enough to correspond to these definitions. At the 
present time there are several kinds of language proficiency tests used 
in the various agencies, i.e. different tests are used to measure the 



field. 




10 



4 Testing Languap.a Proficiency 



Scime skill because of differing circumstances. In some cases we feel 
confident that the correlation between the performance on the test and 
the performance in a real-life situation is ^ood In other cases we are 
less certain, mainh because no validation studios have been made 
with the definitions as a basis. 

Speaking proficiencv is tested in a direct wa\ ai FSI and the CIA by 
means of an Oral Interview Test. In spito ,)f its drawb acks, this method 
probablv provides the most valid measujement i f general speaking 
proficiencv currenth available. Research which is now in progress 
indicates that the reliability of the oral interview test is also ver\ good. 
But it has certain disadvantages with respect to its administration. It 
is expensive and limited in that trained testers must be present to ad- 
minister it. There is often a need to test large populations or to give a 
test at a location to which it would not be economically feasible to 
send a testing team. What are the alternatives? There are several tests 
of speaking proficiencv now available which are not limited b\ these 
restrictions, but unfortunateh they do not provide a sufficiently ade- 
quate measurement for our purposes. For example, mo.st structured 
oral language tests use a text, pictures or a recording as the stimulus. 
The response of the examinee is limited and often unnatural. There is 
little possibilitN for v ariation. It is somewhat similar to doing archaeo- 
logical fielii work b\ looking at black and white snapshots of the site. 
You can get an idea, but you cannot explore. There is also the possi- 
bility of inferring a speaking proficiency level on the basis of a listen- 
ing comprehension test, but we do not yet have convincing data to 
show that a high enough correlation exists between the two types of 
tests. We are still looking-and should continue to look-for alternate 
means of testing speaking proficiency. 

Because the re(|uirements for language use differ from agency to 
agency, the relative importance of testing certain skills also differs. 
The testing of listening comprehension provides a good example. With- 
in the various language schools there are several kinds of listening 
comprehension tests, including a number of standardized multiple- 
choice tests of the type familiar to all of us. These tests provide the 
desirable element of objectivity, but they are also open to some serious 
questions. For example, is 'it really possible for a test with this format 
to correspond in any way to the Civil Service definitions, which are 
expressed in functional terms? A multiple-choice test can serve as an 
indicator of proficiency, but until we can validate it against perform- 
ance based on the definitions, we do not know how accurate the indi- 
cator is. Multiple-choice listening comprehension tests also have cer- 
tain inherent problems such as memory, distraction, double jeopardy 
(if both the stimulus and alternatives are in the target language) 
and mixed skills (i.e. Ihij examinee may be able to understand the 
O 



Testing Longuage Proficiency in the United States Government 5 



stimiiliis, but ma\ not be able to road the alternatives). It is possible 
for «in examinee to understand the tar^el lanj^ua^e quite well, \et 
score low on a test because of other factors. There is. unfortunately, 
no way to make a direct assessment of a person's <ibilit\ to compre- 
hend a foreign language short of gaining access to his language per- 
ception mechanism, vvhatever that is. 

At FSl there is no re(juirement to distinguish between spnaking and 
listening comprehension, thus the FSl S-rating is a combination of 
both; comprehension is one of the factors on which the S-rating is 
based. At the CIA a distinction is made between an S-rating and a U- 
rating (U = understanding), but a separate listening comprehension 
test is not given. In most cases the judgment about an examinee's com- 
prehension ability is made on the basis of his performance on the oral 
interview. Such a method is potentially problematic if an examinee s 
skill in understanding the language greatly exceeds his ability to speak 
it. The level of language difficulty in the interview is necessarily dic- 
tated b\ the examinee's speaking proficiency, thus hi!, skill for under- 
standing what the examiner says ma\ not be sufficisntis challenged. 
To correct this deficiency the ClA Language Learning Cenler is pres- 
entl> experimenting with the use of taped passages as a part of the 
oral interview. We have yet to overcome some problems in this regard, 
not the least of which is the establishment of evaluation criteria. 

All of the agencies have a requirement for testing reading profi- 
ciency. At FSL and in some cases at the CIA. the last ten to fifteen 
minutes of the oral interview are spent in an oral translation exercise. 
An approximation of the examinee's reading proficiency is made on 
the basis of his speaking proficiency. He is then given a short passage 
in the target language -often taken directK from a current newspaper 
or magazine- which he reads anJ retells in English without the aid of 
a dictionary. The passages are scaled according to the eleven levels of 
proficiency, and the examinee must be able to give a good, accurate 
rendering in order to receive the rating which corresponds to the level 
of the passage. If the linguist feels that the passage was not appro- 
priate for the examinee, he can choose a second one of greater or les- 
ser difficulty. In a typical test three to four passages are read. This 
method has the advantage of being « asy to administer. It is also a reU 
atively simple matter to change the test by changing the passages, 
provided they are properly scaled. Unfortunately , there has never 
been a reliability study made of this tesr Furthermore, in spite of the 
directness of oral translation in comparison to a multiole-choice test, 
it cannot ye! be assumed that the examinee s performance in trans- 
lating correlates directly with his ability to read and comprehend 
written material in the target language. Again, it will be necessary to 
make an exhaustive validity study before we can be assured that it 

E^C 12 



6 Testing Language Proficiency 



does, in fact, provide an accurate measure of reading proficiency. 

Multiple-choice reading* proficiency tests are used on a regular basis 
at DLl and the CIA Languaj^e Learning Center. The objectivity and re- 
liability provided by these standardized tests is dqsirable indeed but 
the disadvantages must also be acknowledged. In our case, we have 
had only one form for each language for more than ten years. Ob- 
viously some employees have taken the test more than once, some- 
times within a relatively short period of time. Validity in such cases 
is. of course, questionable. For this reason we are in the process of 
devising a new testing model which we feel is a more valid measure- 
mentj)f reading proficiency, and for which we plan to make nnnltiple 
forms. 

It may sound like heresy to some ears, but in all agencies transla- 
tion tests are used in certain cases for veasuring reading proficiency. 
However, we really have no empirica' ^ idence about the validity, or 
lack of it. of such tests. The main administrative problem with this 
type of tes> is scoring. It must be done manually, and with so many 
possibilities for mistakes of differing magnitude it is difficult to devise 
a reliable method of scoring. The use of translation as a testing device 
should not. however, be discarded. 

Within the Government language community the greatest amount of 
research in the area of reading proficiency testing has been done at 
DLI's Systems Development Agency in Monterey. Ca/fornia. Here a 
team ^ f linguists and psychometricians is working on many of the 
problems of testing, especially test validity. They are also charged 
with the awesome responsibility of developing listening comprehen- 
sion and reading tests for more than fifty languages, so a practical bal- 
ance between research and development has to be maintained. Other 
Defense Department language programs are also occupied with the 
challenge of developing new kinds of reading tests and have dis- 
covered some novel, interesting techniques of getting at the problem. 

The Government Accounting Office (CAO) ^'Report to the Congress 
on the Need to Improve Language Training Programs and Assignments 
for U,S. Government Personnel Overseas" discusses some of the prob- 
lems of testing language proficiency in U.S. Government agencies and 
suggests that research and development of language tests be coor- 
dinated among the agencies. We cannot know how effective our lan- 
guage training programs are or how valid our mechanism for assigning 
personnel to language-essential positions is unless we are confident 
that our testing programs provide an accurate measurement of lan- 
guage proficiency. We are reasonably satisfied that our present system 
works; but we should not be completely content with it. as there is still 
much to be done. 

The U.S. Government language community has had vast experience 



Testjnx Language Proficiency in the United States Government 7 
0 



with language testing -each \ear more Ihcin seven thousand people 
are tested in approximately sixty different languages. The range of 
proficiency covers the entire spectrum: all the ua\ from the beginner 
to those who have a command of the language equivalent to that of an 
educated native speaker. A large amount of data is thus generated 
which can be of value not onK for our purposes, but for an\one in- 
terested in language testing. A cooperative effort on the part of Gov- 
ernment and non-Government language interests would therefore be 
of great mutual benefit. 

Since the Government has such a large stake in improved testing, 
should v\e not dedicate a greater portion of our resources to research, 
in order to learn more about the tests we are presentU using, as well 
as to experiment with new teclmiques? Perhaps this svmposium will 
be the stimulus to initiate a comprehensive program of evaluation, re- 
search and development in language proficiency testing. 



ERLC 



DISCUSSION 

Lado: I think the paper was very helpful in giving us a broad presentation of 
many of the issues that interest us. I do not agree that the interview is more 
natural than some of the other forms of tests, because if I'm being inter- 
viewed and 1 know that m\ salarv and my promotion depend on it, no matter 
how charming the interviewer and his assistants are, this couldn't he any 
more unnatural. I v )uld also argue against considering Civil Service defini- 
tions as dogma. In my view they can be changed, and i^etter definitions can be 
found. One further point. "We shouldn't discard translation, " we were told 
a couple of limes. I v.ould like to discard translation, especially as a test of 
reading. 

Jones: Any test is unnatural and is going to create anxiety, especially if one's 
salar> or grade depenils on it. As a matter of fact, just speaking a foreign lan- 
guage in a real-life situation can cause anxiety. As to the Civil Service defini- 
tions, they are not dogma, and they may well be changed. Finally, translation, 
as is the case with all reading tests, is one indirect measure of a person's 
ability to understand written language. It has its drawbacks, but it also has 
its merits. 

Nickel: There is certainly a revival-a renaissance-in the interest of transla- 
tion now taking place m Europe. In some work we have done we seem to see 
a certain correlation between a skill like speaking and translation, and I feel 
that translation tests are useful for skills other than translating. 
Davies: You mention speaking, listening and reading level 3. I'd like to 
know whether level 3 for reading is supposed to be equivalent in some way 
to level 3 for listening. It seems to me that as you read them through very 
quickly,, they mean very different things. 

Jones: As far as the structure of the language is concerned, they should. It 



8 Testing Lan^ua^t* Proficiency 



should not be inferred that a le\el 3 iinderstander {aural) is also a level 3 
reader. 

Davies: In talking about reading, you said that, "the passa>*es are scaled 
accordinj^ to the eleven levels of proficiencx. " How were the\ scaled? 
Wilds: Perhaps i can answer that question, although it began so long ago 
that it's hard to sa\ how the> were scaled initialK. Since the beginning, new 
passages have been matched with old ones so that they are proven to be in 
an order of difficult\ which seems to hold true for everybody who takes the 
test. .'\ passage that is graded 3+. for ex<imple. will not be given that final 
gituie until it is shown l)> several dozen examinees to match the performance 
on accepted 3-h passages. I might sa> that there are no 0 or 0+ passages as far 
as I know, so there are really only 9 levels. And in many languages where 
there aren't many tests, there are no plus ratings on the passages. You need 
a great many examinees to make it finer than that in gradation. 
QuiAones: I'd like to add that this decision was certainly not based on 
the definitions, although, at least in our case, we looked at them when we 
were scaling the pas.sages. Also, we made an attempt to look at the passages 
from the point of view of frequency of words and complexity of sentences. 
But ultimateU it was a subjective decision b> the test consScuctors.. and the 
ultimate decision for keeping the passage was based on the experience of 
using the passages and having people at different levels handling them. 
Sake: In your presentation >ou mentioned that the CIA wab experimenting 
with taped passages. How far along are you on this experiment, and do you 
foresee an instrument that is as reliable and valid as the one you are now 
using? And if so. do you think there will be a substantial savings in rating 
people? 

Jones: Our primary concern with giving a rating for listening comprehension 
on the basis of the oral interview is that if thf> person tested is able to under- 
stand the language ver\ well, but for some reason is deficient in speaking, it 
is very likeU that he w!ll get a low ratmg for listening comprehension. There 
is no way for him to demonstrate that his ability to understand exceeds — in 
some cases by as much as two levels — his ability to speak. So our experimen- 
tation in this respect is primarily to find out whether, on the basis of the taped 
passages, a person might be .ible to understand better than is evident from the 
interview. We have a fairK good idea of his minimal level, the taped passages 
we hope will bring out anything that exceed that. Our primary problem is. 
once again, trying to line the passages up with our levels. 
Hindinarsh: Is there a definition of writing proficiency? 
Jones: Yes. there is. We rarely test writing, however, and our research and 
development projects iire not currently concerned with any type of writing 
proficiency lest. 

Frcy: I'd like to ask about the expense involved in oral testing. We found, of 
course, that it's very expensive. I wonder how long your tests take, and how 
exoensive they arc? 





Testing i,anguage Proficiency in the United States Government 9 



Jones: I couldn't reall> quole a dollar fijjiire. but Jim Frith quotes a figure of 
$35 00 for a test of speaking and reading. It's a very expensive type of test 
because we have two testers with the examinee for a period of an\ where 
from 15 minutes to more than a half hour, depending on the level. A person 
who comes in with a 0+ level doesn't take long to test. However, if a person 
is up in the 4 or 4 + range, we have to take a lot more time to explore and 
find out where the border really is. We feel, however, that whatever the ex- 
pense is. it's worth it. We have to have this kind of a test to be able to find out 
what a person's abilitv to speak really is. While it would be possible to use 
taped tests, if you ha\e to take time to listen to the tape anyway, why not do 
it face to face in the first place'^ 



ERIC 




Theoretical and Technical Considerations in Oral 
Proficiency Testing 

John L D. Clark 



The intent of this paper is to idenlifv and discuss some of the major 
theoretical and practical ct;nsi(lerations in the development and use 
of oral proficiencv tests. A feu definitions are required in order to 
identify and delineate the area of discussion. A proficiency test is 
considered as an\ measurement procedure aimed at determining the 
examinee's abilitv to receive or transmit information in the test lan- 
guage for some pragmaticalK useful purpose within a real-life setting. 
F'or example, a lest of the student's abilitv to comprehend various 
tvpes of radio broadcasts or to understand the dialogue of a foreign 
language film would be considered a proficiencv test in listening com- 
prehension. A proficiency test in the area of written production 
would involve measuring the student's abilitv to produce such written 
documents as notes to the plumber, informal letters to acquaintances, 
and various tvpes of business correspondence. In all cases, the em- 
phasis in proficiencv testing is on determining the student's abilitv to 
operate eTfectively in real-life language use situations. 

In the testing of oral proficiency, possible real-life contexts include 
such activities as reading aloud (as in giving a prepared speech) dic- 
tating into a tape recorder, talking on the telephone, and conversing 
face-to-face with one or more interlocutors. In terms of the relative 
frequencv of these speaking activities, face-to-face conversation is 
definitelv the most highlv preponderant, and with some justification, 
the term "oral proficiencv" is usuallv thought of in terms of a con- 
versational situation. 

A further distinction is necessarv between two major subcategories 
of proficiencv testing, direct and indirect. In direct proficiencv test- 
ing, the testing format and procedure attempts to duplicate as closelv 
as possible the setting and operation of the real-life situations in 
which the profici«jncv is normallv demonstrated. For example, a 
direct proficiencv test of listening comprehension might involve the 
prtjsentation of taped radio broadcasts, complete with the static and 
somewhat limited fre(|uencv range tvpical of actual radio reception. 
A direct proficiencv test of reading comprehension would ij[ivolve the 
use of verbatim maga/cine articles, newspaper reports, and other texts 




10 



Theoretical and Technical Considerations in Oral Proficiency Testing 11 

actually encoiinlered in real-life reading siliiations. A direct test of 
oral proficiencx. in the face-to-face communication sense, would in- 
volve a test settin)^ in which the examinee and one or more human 
interlocutors do. in fact. en^^Mge in communicative dialogue. A major 
re(|uirement of direct [)roficienc> tests is that the> must provide a 
ver\ close facsimile or "work sample" of the real-life language situa- 
tions in question. v\ith res[)ect to both the setting and operation of 
the tests and the linguistic areas and content which the\ embody. 

Indirect proficiencv tests, on the other hand, do not require the 
establishment of a highh face-\ alid and representative testing situa- 
tion. In some cases, of course, an indirect test may involve certain 
quasi-realistic activities on the student's i)art. For example, in the 
speaking area, a test which is defined here as indirect ma\ require the 
student to describe printed [)ictures aloud or in some other vva\ pro- 
duce intelligible s[)oken res[)onses. However, since such testing pro- 
t:e(lurc»s are not trul\ reflective of a real-life dialogue situation. 
the\ are considered indirect rather than direct measures of oral 
proficiency. 

Other indirect techniques ma\ have virtuallv no formal corre- 
spondence to real-life language activities. One example is the so- 
called "cloze" technique, in which the examinee is asked to resup- 
pl> letters or words that have been s\stematicall\ deleted from a 
continuous text. This s[)ecific behavior would rarelv if ever be called 
for in real-life situations. 

The validitv of these and other indirect procedures as measures of 
real-life [)roficiencv is established through statistical -specifically, 
correlational -means. If and when a given indirect test is found to 
correlate highlv and consistentlv with more direct tests of the profi- 
ciencv in (juestion. it becomes useful as a surrogate measure of that 
proficiencv. in the sense that it permits reasonablv accurate predic- 
tions of the level of [)erf()rmance that the student would demonstrate 
if he were to undergo the more direct test. This [spa of correlational 
validitv is usuallv referred to as congruent or concurrent validitv. 

In addition to being either face/'content-valid or concurrently-valid, 
as requin!(l. direct and indirect [)roficiencv tests must also be reliable, 
\n the sense that thtiv must [)rovi(b» consistent, replicable information 
about student [)erformance. If no intervening learning has taken 
|)lace. a given student would l;e ex[)ected to receive a^)^)roximately 
the same scort? on a number of different administrations of the same 
lest or alternate forms thereof. If. however, test scores are found to 
varv a|)|)reciablv through influences other than changes in otudont 
ability, test unreliabilitv is indicated, and the measure accordingly 
becomes bjss ap[)ro[)riate as a true measure of student performance. 
Finallv, both direct and indirect [)roficiencv tests must have a 




12 Testing Language Proficiency 



salist'aclorv decree of practicality. No mailer how highly valid and 
reliable a parlicular lesling melhod ma\ be. il cannol be serviceable 
for ••real-world" applicalions unless il falls uilhin acc^ulable limils 
of cosl. manpower re(juiremenls. and lime conslrainls for adminis- 
tration and scoring. To overlook or minimize ihese asipecls when plan- 
ning and developing lesling procedures is lo courl serious disillusion- 
ment when the procedures go through the Irial-by-fire of operational 
use. 

We have so far defined the area of ' oral proficiency testing";, iden- 
tified direct and indirect techniques within aus area; and outlined the 
three major considerations of validitN. rel ^'>i:'t\. and practicality as 
touchstones for a more detailed anahsis ol specific testing proce- 
dures. In conducting this anaUsis. it will also bo ! ^Ipful to present a 
brief taxononn of theoreticall\ possible leslin^ procedures and 
identify the possible procedures which m:;sl adequately fulfill the 
validilN. reliabilit\. ancF practicalil\ criteria lh«r have been discussed. 

Two major components of any testing prn,e(lure are administra- 
tion and scoring. Administration is the process b\ which test stimuli 
are presented lo the e.xaminee. "Mechanical" administration refers to 
procedures in which test booklets, tape recorders., videotapes, or 
other inanimate devices are for all practical purposes entirely respon- 
sible for lest administration. An\ input by a "live" examiner is re- 
stricted to peripheral matters such as giving general directions and 
handing out lest materials. "Human" administration, on the other 
hand, requires the presence of a live examiner who is actively and 
conlinuoiisK invoked in the testing process: reading test questions 
aloud, conversing with the student in an interview situation, and so 
forth. 

Test scoring is the [)rocess by which the student's responses to the 
test stimuli are converted to numerical data or numericalh codeable 
data such as the scoring levels of the FSI-tvpe interview. The scoring 
process can also be either "mechanical" or "human." In "mechanical" 
scoring, student responses are converted automatically., i.e. without 
any thought or judgment on the part of a human rater, to the appro- 
priate score. This would include the scoring of multiple-choice re- 
spons(?s. ehhnv b\ machine or by a human performing the same 
mechanical chore, and also the automatic evaluation of spoken re- 
sponses through voice recognition devices or similar electronic 
means. In "human" scoring, one or more persons must actually listen 
lo the responses of the examinee and exercise a certain degree of 
though! or judgment in arriving at a rating of the examinee's per- 
formance. 

Test scoring, both mechanical and human, can be further divided 
^inlo •'simultaneous" scoring and "delayed" scoring. Simultaneous 

ERIC 



Theoretical and Technical Considerations in Oral Proficiency Testing 13 

scoring is carried out on the spot, either during or immediately fol- 
lowing the tost itself, and there is no need to tape record or in any 
other wav preserve the examinee's responses. In delayed scoring, the 
lest responses of the examinee arc recorded for evaluation at a later 
time. 

Table 1 below summarizes possible combinations of administration 
technique (mechanical/human), scoring technique (mechanical/ 
human), and time of scoring (simultaneous/delayed), and. gives 
examples of actual tests or theoretically possible tests based on these 
combinations. 



Table 1 

An Inventory of Passible Administration and Scaring Mades 
far Oral Proficiency Testing 



Administratian Scaring 
1 Mechanical Mechanical 

2. Mechanical Mechanical 

3. Mechanical Human 

4. Mechanical Human 



5. Human 

6. Human 

7. Human 

8. Human 



Mechanical 
Mechanical 
Human 

Human 



Time of Scaring Examples 

Simul. Speech Auta-Instructianal Device 
(Buiten and Lane 1965); SCOPE 
Speech Interpreter (Pulliam 1969). 

Delayed As in (1), using previously recorded 
responses. 

Simul. Test administration via tape recorder 
and/or visual stimuli; human scorer 
evaluates responses on-the-spot. 

Delayed Tape recorded speaking tests in 
typical achievement batteries (MLA- 
Caaperative Tests, MM Proficiency 
Tests far Teachers and Advanced 
Students). 

Simul. Unlikely procedure. 

Delayed Unlikely procedure. 

Simul. Face-to-face interviews (FSI; Peace 
Corps/ETS). 

Delayed As in (7), using previously recorded 
responses. 



To discuss first the area of direct oral proficiency tests, the possible 
combinations of administration and scoring procedures are highly 
restricted by the need to provide a valid facsimile of the actual com- 
municative situations. Since the instantaneous modification of topical 
content characteristic of real-life conversational situations cannot be 
duplicated through tape records or other mechanical means, "human" 
administration is required. This restricts the available possibilities 
to categories 5 through 8 in Table 1. Of these, human administration 



ERLC 



on 



14 Testing Language Proficiency 



and mechanical scoring (categories 5 and 6) would involve the use of 
some type of device capable of analyzing complex conversational 
speech. At the present time, no such device is available. 

The remaining categories are 7 and 8. Category 7~human adminis- 
tration and simultaneous human scoring-is exemplified by the face- 
to-face interview of the FSI type' in which one or more trained indi- 
viduals administer the test stimuli (in the sense of holding a guided 
conversation with the examinee) and also evaluate the student's per- 
formance on a real-time basis. Category 8-human administration and 
delayed human scoring- would also involve a face-to-face conversa- 
tion, but the scoring would be carried out at a later time using a tape 
recording of the interview or a videotape with a sound track. 

From the standpoint of validity, tests in categories 7 and 8 approach 
real-life communication about as closely as is possible in the test 
situation. Face-to-face conversation between examiner and examinee 
on a variety of topics does, of course,, differ to some extent from the 
contexts in which these communications take place in real life, and 
the psychological and affective components of the formal interview 
also differ somewhat from those of the real-life setting. As Perren 
' points out: '\ . , both participants know perfectly well that it is a test 
and not a tea-party, and both are subject to psychological tensions, 
and what is more important, to linguistic constraints of style and reg- 
ister thought appropriate to the occasion by both participants/'^ 
However, except, for such exotic and ultimately impractical tech- 
niques as surreptitiously observing the examinee in real-life linguistic 
settings-ordering meals, talking with friends, communicating on the 
job. and so forth-it is difficult to identify an oral proficiency meas- 
urement technique with a usefully higher level of face validity. 

With respect to the reliability of the interview procedure, it can be 
asked whether simultaneous or delayed evaluation of the interview 
permits more reliable scoring. In connection with an interviewer 
training project which Educational Testing Service has been conduct- 
ing with ACTION/Peace Corps, 80 FSI-type interviews in French were 
independently scored by two raters simultaneously present at the 
interview, and their ratings agreed as to basic score level (0. 1. 2. 3, 4. 
5) in 95 percent of the cases, Scoring of tape recorded interviews by 
two or more independent raters (i,e. the "delayed" technique) has 
informally been observed to attain about the same levels of reliabil- 
ity, but much more detailed scoring reliability studies would be 
desirable for both modes of scoring. 

Certain attributes of the simultaneous scoring procedure could be 



'Rice 1959; Foreign Service Institute 1963. 
^ »erren 1967. p. 26. 

ERIC oi 



Theoretical and Technical Considerations in Oral Proficiency Testing 15 



viewed as more favorable to high scoring reliabililv than the delayed 
procedure. First, all relevant communicative stimuli are available to 
the ^corer. including the examinees facial expressions, gestures, lip 
movements, and so forth. Unless a video recording of the interview 
is made (rather than an ordinary tape recording), these components 
would be lost to the rater in the delayed scoring situation. Second, 
simultaneous scoring may benefit from a " recency of exposure" 
factor in that the rater has the conversation more clearK and more 
thoroughly in mind than he or an\ other scorer could have at a later 
time. Third, when the test administrator and scorer are present simul- 
taneously (or when a single interviewer fills both roles), the interview 
can be lengthened or modified in certain ways which the scorer con- 
siders important to a comprehensive assessment of the candidate's 
performance. In delayed scoring, the rater must base his judgment on 
whatever is recorded on the tape, and he has no corrective recourse 
if the interview happens to be too brief or otherwise unsatisfactory 
for effective scoring. Finally, when the interview is scored on the 
spot, there is no possibility of encountering technical difficulties such 
as paorly recorded or otherwise distorted tapes tha» might hinder 
accurate scoring in the delayed situation. 

On the other hand, there are a number of features of the delayed 
scoring arrangement that might be considered to enhance scoring 
reliability. First, there would be no opportunity for variables such as 
the interviewee s mannerisms or personal attractiveness to affect the 
scoring process. Second, there could be a better control on the scor- 
ing conditions, in that the interview tapes could be more effectively 
randomized, intermingled with tapes from other sources, and so 
forth than is usually the case when live examinees must be scheduled 
at a given testing site. Third, delayed scoring would allow for repeti- 
tive playback of all or selected portions of the interview to resolve 
points of doubt in the sco.er's mind~a possibility which is not 
available in the simultaneous scoring situation. 

In view of these and other conflicting interpretations of the poten- 
tial reliabilities of simultaneous and delayed techniques, a compre- 
hensive experimental study comparing these two procedures would 
seem very much in order. 

With respect to the prQcticality of interview testing of the FSI type, 
an obvious concern is the need to involve expensive humans in both 
the test administration and scoring process. Since there appears to be 
no alternative to such an approach~at least within the context of 
direct proficiency testing-the question is reduced to that of making 
the most effective use of the human input required. 

The manpower requirements can be reduced to a considerable ex- 
^toqi hv decreasing the total testing time per examinee. Interview tests 

ERJC 

22 



16 Testing Language Proficiency 



of the FSI type typically require approximately 15 to 30 minutes, 
vvilh somewhat shorter or longer testin)^ times for very limited or 
uxlremel\ proficient examinees, respectively. Evaluation of the stu- 
dent's performance and assignment of a score level would usually re- 
quire an additional 2 to 5 minutes beyond the running time of the 
interview itseli. When interviewing on a group basis, it is difficult for 
a single tester or team of testers to administer more than about 15 
interviews per day. 

Since test administration time and the associated manpower ex- 
pense is probably the largest single drawback to widespread use of 
the full-scale interview procedure, there would be considerable 
interest in determining the extent to which a face-to-face interview 
could be abbreviated without serioush affecting either the validity 
of the test or its scoring reliability. Considerable informal experience 
in connection with the Peace Corps testing project suggests that the 
examinees basic score level (i.e. his assignment to one of the six 
verbally-defined score levels) can be fairly accurately established 
within the first 5 minutes of conversation. If evaluation at this level of 
specificity is considered acceptable-as distinguished from the de- 
tailed diagnostic information and assignment of applicable "plus" 
levels obtained in a full-length interview -test administration and 
scoring expense would be reduced by a factor of three or four. 

Although shorter interview times do reduce the number of topical 
areas and styles of discourse that can be sampled, the effect on 
scoring reliability may not be so great as has commonly been as- 
sumed. In any event, the matter of optimum interview length is a 
strictly empirical question which should be thoroughly explored in 
a controlled experimental setting. An appropriate technique would 
be to have a large number of trained raters present at a given inter- 
view. At the end of fixed time intervals (such as every 5 minutes),^ 
subgroups of these raters would leave the interview room and assign 
ratings on the basis of the interview performance up to that time. 
These ratings would be checked for reliability against the ratings 
derived from partial interviews of other lengths and from the full- 
length "criterion" interview, 

A second major component of interview practicality is the question 
of using 1 or 2 in!erview3rs. The traditional FSI technique has been 
to use 2 trained interviewers wherever possible. One interviewer 
lakes primary respor»sibilit\ for leading the conversation, and the 
other carefully listens for and makes notes of areas of strength and 
weakness in the examinee's performance. The second interviewer 
may also intervene from time to time to steer the conversation into 
areas which the first interviewer may have overlooked. At the con- 
clusion of the interview, both examiners discuss the student's per- 



Theoretical and Technical Considerations in Oral Proficiency Testing 17 



formance and miituall\ dptermino the score level to be assij,'ned. The 
chief disadvantage of the tv^u-examiner technique is the increased 
manpower cost, which is effectiveU double that of the single-exami- 
ner procedure. Again, deiaile-i comparativt; studies would be neces- 
sar\ to determine whether the participation of a second interviewer 
results in a substantial and economically-justifiable increase in 
scoring reliability. 

In analyzing simultaneous and delayed interview scoring tech- 
niques from the stand[)oint of practicality, the simultaneous proce- 
dure appears clearK preferable. Indeed, simultaneous scoring can be 
considered almost "free of charge" in the sense that the examiner(s)- 
alreadx necessariU on hand to administer the interview-require 
onl\ a few additional moments to determine the appropriate score 
level. B\ contrast. dela\ed scoring requires the complete "replaying" 
o( the interview, and although certain [)rocedures such as time com- 
pression of the tape recording (Cartier 1968) or preliminary editing 
of several interviews into a single continuous tape (Rude 1967) might 
decrease the sc(»fing time somewhat, it is doubtful that delayed scor- 
ing could ever be made as economical as simultaneous scoring carried 
out b\ the lest administrators themselves. A further disadvantage of 
the dela\ed scoring technique is the a[)preciablN longer turnaround 
time for score reports to students and instructors. 

The preceding discussion of direct [jroficiency measurement tech- 
niques ma\ be summarized as follows. The need to provide a face- 
valid communicative setting restricts test administration possibilities 
to the face-to-face interaction of a human Jester and examinee. Be- 
cause mechanical devices capable of evaluating speech in a conversa- 
tional situation are not a viable possibilit\ at the present time, the 
scoring of the test must also involve trained human [)articipation. 
Within these constraints, the possibilities of selection among the eight 
testing categories shown are reduced to a choice between simulta- 
neous and dela\ed scoring. The relative levels of reliability obtain- 
able through simultaneous and delayed scoring have not been estab- 
lished on an\ rigorous basis, and logical arguments can be advanced 
in favor of both techniques. Considerations of practicality point to 
simultaneous scoring of the proficiency interview as an appreciably 
more efficient and economical technique. 

Turning now to indirect measures of oral proficiency, the testing 
possibilities are ex[)anded in that there is no longer a requirement 
for a face-valid (i.e. human-administered, conversational) administra- 
tion setting, and mechanical administration technnjues can be con- 
sidered. With reference to Table 1, the first rwo categories of 
mechanical administration and mechanical scoring would involve 
such techniques as the student s imitation of isolated sounds or short 




24 



18 lesting Longud^o Proficiency 



|)hrases in the lest lan^iui^u. with the res[)onses e\aIuaU?(I by com- 
|)Uler-l)ased spe(»ch rt?Lu>;nilion ilevices. Huilon and Lane (1965) ilevel- 
{)\nii\ a S|)<je(,h Aiilo-lnslruclional Device (;a[)al)le of exlraclin^ pilch, 
loudness, and rh\lhm [)arannjleis from short spoken phrases and 
com|)aiin^ these lo inlernallx -stored criteria of accuracv. Pulliam 
(UH)9) has (lescrd)eil the (hnelo[)menl of an experimental speech 
inler|)reler. also (:oni[)uler-l)astMl. which can evaUiate the examinee's 
[jronuncialion of s[)ecific short utterances. Drawbacks to the use of 
these devices inchule et|ui[)ment cost and complexitv and also the 
exlremelv limited re[)ertoire of sounds or phrases that can be eval- 
uated uith a single [)ro^rammi!iM of the machines, it is also quite 
doubtful thai e\en (he ver\ [)recise measuremenl of ihe student's 
[jronuncialion aci,urac\ that mi^ht be afforded b\ these devices 
uould show a hi^h correlation with general [)roficienc\. in view of 
the man\ other variables which ai'o involved in the latter per- 
formance. 

Cale^^orv 3 -mechanical test administration and simultaneous hu- 
man scoring; -does not a[)[)ear to be productive. One possible applica- 
tion would be the ta[)e recorded presentation of questions or other 
stimuli lo which the examinee would res[)ond. with on the spot evalu- 
ation In a human rater. Such a technique would, however, afford no 
savin)^ in manpower over a regular face-lo-face interview, and there 
would seem to be no practical reason to [)refer it over the latter, more 
direct, technique as a me<ms of overall [)roficiency testing. 

(iale)»or\ 4-- mechanical administration and debned human scorin>4 
-offers considerabh greater testing [)ossibilities. Included in this 
cale>;oi\ are the speaking tests in lar^e-scale standardized batteries 
such <is the MLA Foreign Language Proficiency Tests for Teachers and 
Advanced Students (Starr 19t)2): the MLA-Cooperative Foreign Lan- 
guoge Tests (Educational Testing Service 1965): and the Pimsieur Pro- 
ficiency lests (Pimsieur 1967). The ^^eneral technique in these and 
smiilar tests is to coordinate a master ta[je recorciing and student 
booklet in such a v\a\ that both aural stimuli (such as short [)hrases 
lo be mimi(>ketl. (jiieslions to which the sUulent responds) and visual 
stimuli lpinit(5(l tests to be read aloud, pictures to be described, etc.) 
can bt presented. The master tape also gives the test instructions and 
paces the student through the various parts of the test. 

It is fairly well established that the types of speaking tasks pre- 
sented in a standardized speaking test cannot be considered highly 
face-valid measures of the student's communicative proficiency. As 
previously indicated, the most serious drawback in this respect is that 
it is not possible to engineer a mechanically-administered test in such 
a wav that the stimulus questions can be changed or modified on a 
real-time basis to correspond to the give-and-take of real-life com- 



Theoralicol and Technical Considerations in Oral Proficiency Testing 19 



municalion. In addition to this l)asic difficulty, a substantial propor- 
tion of the specific testing formats used in these tests-mimicry of 
heard phrases, descriptions of pictures or series of pictures, reading 
aloud from a printed text -are at least some steps removed from the 
face-lo-face conversational interaction implicit in the concept of oral 
proficiency. For these reasons, it appears more appropriate and more 
productive to classify and interpret the MLA Proficiency Tests, the 
MLA-Cooperative Tests, and similar instruments as indirect meas- 
ures of oral proficiency which reveal their appropriateness as profi- 
ciency measures not through the observed validity of their setting, 
content, and operation but through the degree to which they may be 
found to correlate on a concurrent basis with direct measures of 
oral proficiency. 

Unfortunately, the detailed correlational studies needed to estab- 
lish the concurrent validity of these indirect measures vis-a-vis direct 
[)roficiency tests are for the most part lacking. In connection with a 
large-sccde survev of the foreign language proficioncv of graduating 
colKge language majors. Carroll (1967) administered l)oth the speak- 
ing test from the MLA Proficiency Battery and the FSl interview test 
to small samples of students of French. German. Russian, and Span- 
ish. Correlations ranging from .66 to .82 were obtained, representing 
moderate to good predictive accuracy. To the extent that scoring of 
the indirect speaking tests is itself an unreliable process, the observed 
correlations between these tests and the FSl interview or similar 
direct procedures would be attenuated. 

It is interesting to note that standardized speaking tests of the MLA 
type are generally considered to have higher scoring reliabilities than 
the freer and less structured interview techniques. This opinion may 
be attributable in part to the impressive technical accouterments of 
the standardized tests, including the language laboratory administra- 
tion setting and the accompanying master test tapes, student booklets, 
and response tapes. However, eviden-^e available to date does not 
support a high level of scoring reliability for tests of this type. 

Starr (1962) has discussed some of the difficulties encountered in 
the scoring of the MLA Proficiency Speaking Tests, including a **halo 
effect" when a single rater was required to score all sections of a 
given test tape and the gradual shifting of scoring standards in the 
course of the grading process. Scoring reliability of the MLA-Co- 
operative Speaking Tests was examined in a study of the two-rater 
scoring of 100 Fench test tapes (Educational Testing Service 1965). 
Among the different test sections, scoring reliability ranged from .78 
(for the picture description section) to a low of .31 (mimicry of short 
phrases). The inter-rater reliability for the entire test was only .51. 
Scoring reliability for the Pimsleur speaking tests was not reported 

s 20 ' 



20 Tasting Langua)^r ProficUmcy 

in the test manual, and Pimslenr in(ii(:at«»(i that "because of the naiure 
of the test." the speakinj; lest scores should be interpreted with 
caution.' 

These results raise an interesting; question -specifically, whether 
carefully desijjned direct proficiency interviews mij^ht not exceed in 
scoring reliability the levels so far observed for the more indirect 
standardized tests. Additional studies of the scorinj> reliabilities of 
both types of test would seem very much in order. 

In regard to the; question of practicality, mechanically-administered 
speaking tests do save administration time in that a number of stu- 
dents can be tested simultaneously in a language laboratory setting. 
However, during the scoring process each student response tape must 
still l)e evaluated individually by human listeners, and to the extent 
that the scoring time for the indirect recorded test approaches the 
combined administration/scoring time of the direct proficiency inter- 
view, any manpower advantage of the tape recorded procedure is 
lost. 

With regard to typical scoring times for tape recorded tests, it is 
interesting to note that scorers evaluating th(. MLA Proficiency Test 
tapes on a volume basis were typically able to score approximately 
15 tapes per day. It bears emphasizing that this rate is not appreciably 
differf»nt from the number of face-to-face interviews of the FSl type 
that a single individual can conveniently administer and score in a 
working day. 

Widely varying scoring rates have been reported for other types of 
tape recorded speaking tests. These range from a maximum of about 
1 hour per student to a minimum of about 5 minutes. The one-hour 
figure is reported by Davison and Geake (1970), who evaluated each 
student's responses according to a number of detailed criteria. The 
procedure also included fretjuent reference to external comparison 
tapes and considerable replaying of the student tapes. The five-minute 
scoring was accomplished by Beardsmor and Renkin (1971), uoing a 
shorter initial test and a tape recording technique which deleted from 
the student tapes all material other than the active responses. 

Generally speaking, the scoring time for tape recorded tests is af- 
fected by a great number of factors, including the absolute length of 
the student's responses, the presence or absence of "dead" spaces in 
which teJ3t directions or stimuli are being heard instead of student re- 
sponses, the frequency with which portions of the test must be re- 
played during scoring, the complexity of the scoring procedure itself, 
the amount of time required to mark down partial scores and calcu- 
late a total score, and even the rewind speed of the machines on 



^Pimsleur 1967. p. l5 




27 



Theoretical and Technical Considerations in Oral Proficiency 7'esring 21 



which the lest tapes are plaved back. In the ideal situation, a combina- 
tion of carefully planned test formats, technological aids such as 
voice-activated relays to operate the student recorders only during 
aiJive responding, and concise and easily-applied scoring standards 
u ild reduce test scoring time considerably while providing for a 
rfficientlv broad sampling of the student s speaking performance. 
On the other hand, lack of care in developing the test formats, admin- 
istration procedures, aad scoring techniques ma\ well result in an 
indirect test of oral proficiency which is appreciably less cost-effec- 
tive in terms of administration and scoring manpower than the direct 
proficiency interview itself. 

All of the indirect tests discussed so far require active speech pro- 
duction on the student's part, even though the speaking tasks involved 
are not closelv parallel to reaMife communication activities. Although 
such tests may be felt to have a certain degree of face Vr iidity in the 
sense that the student is actually required to speak in a variety of 
stimulus situations, their true value as effective measures of com- 
municative proficiency is more appropriately established on a con- 
current validity basis, i.e. through statistical correlation with an FSI- 
type interview or other criterion test that is in itself highly face-valid. 
There is a second category of indirect tests in which the student is 
not even required to speak. Tests of this type must depend even more 
highly on correlational relationships with direct criterion tests to 
establish their validity as measures of oral proficiency. 

Among these "non-speaking" speaking tests.^ the "reduced redun- 
dancy" technique developed by Bernard Spolsky is discussed at 
length elsewhere in this volume. Briefly, the reduced redundancy 
procedure involves giving the student a number of sentences in the 
target language which have been distorted by the introduction of 
white noise at various signal/noise levels. The student attempts to 
write out each sentence as it is heard. On the assumption that stu- 
dents who have a high degree of overall proficiency in the language 
can continue to understand the recorded sentences even when many 
of the redundant linguistic cues available in the undistorted sentence 
have been obliterated, the student's score on the test is consider*jd 
indicative of his general level of language proficiency. 

The Spolsky test has been validated against various listening com- 
prehension, reading, and writing tests (Spolsky et al 1968; Spolsky 
1971), with concurrent validity correlations ranging between .36 and 
.66. The reduced redundancy technique has not to the writer's knowl- 
edge been validated against the FSI interview or other tests requiring 
actual speech production on the student's part, and the extent of cor- 
relation of reduced redundancy tests with direct measures of speak- 
ing proficiency remains to be determined. 

erJc 



2R 



22 Testing Languufnt: Proficiancy 



The **do'/e" lest is another indirect procedure which recently has 
received considerable attention. This technique, originated by W. L. 
Tavlor (1953) in the context of native-language testing, involves the 
s\slematic deletion of letters oi words from a continuous printed 
text, which the student is asked to resupp!> on the basis of con- 
textual clues available in the remaining portion of the text. Nu- 
merous experimental studies of the clo/.e procedure have been car- 
ried out over the past several years (Carroll. Carton, and Wilds 1959: 
Oiler and Conrad 1971). including investigations of the deletion of 
only certain categories of words such as prepositions (Oiler and Inal 
1971); computer-based scoring using a "clozentropy * formula based 
on information theory (Darnell 1968); and human scoring in which 
an\ contextually-acceptable response is considered correct, not 
necessarily the originally deleted word (Oiler 1972). 

Very satisfactory concurrent validity coefficients have been found 
for the cloze tests, using as criteria various other presumably more 
direct measures of overall language proficiency. Darnell (1968) re- 
ported a correlation of .84 between a 200-item clo/.e test and the total 
.score on the Test of English as a Foreign Language (TOEFL). Oiler 
(1972) obtained a correlation of .83 between a clo/.e test scored on a 
contextually-acceptable basis and the UCLA placement examination, 
consisting of vocabulary, grammar, reading, and dictation sections. 

As is the case with reduced redundancy testing, there appears to be 
no experimental information currently available on the e.xtent of 
correlation between doze-type measures and direct tests of oral 
proficiency per se; such studies would be very useful in determin'nt! 
the extent to which tests based on the clo/.e procedure might be used 
as surrogates for direct oral proficiency testing. 

In terms of practicality, both reduced redundancy tests and cloze 
procedures offer considerable advantages. Test administration can 
be carried out on a mechanical basis, using a test tape and student 
response booklet for the reduced redundancy test and a test booklet 
alone for the clo/.e procedure. 

Scoring complexity and time required to score clo/.e tests depend 
on th«» particular grading system used. A major drawback of the Dar- 
nell clozentropy system is the n(»ed for computer-based computation 
in the course of the scoring proces.s; this limits use of the clozentropy 
lechni(jue to schools f)r institutions having the necessary technical 
facilities. Human .scoring of regular cloze tests is rapid and highly 
objective. e«;pecially when exact replacement of the original word is 
the scoring criterion. Multiple-choice versions of the cloze test are 
also possible, further speeding iind objectifying the scoring process. 

Despite the potentially high level of practicality of reduced re- 
dundancy and clo/.e lechnicjues. the ultimate usefulness of these and 




Thcoreticat and Technical Considerotions in Ovol Proficiency Testing 23 

other indirect techniques as measures of oral proficiency will rest on 
the magnitude of the correlations that can be developed between 
them and the more direct measures, correlations based on the simulta- 
neous administration of both kinds uf tests to examinee groups similar 
in personal characteristics and language learning history to those 
students who would evenlually be taking only the indirect lest, it 
should also be noted that tests which do not actually require the 
student to speak would probably noc have as much motivational im- 
pact towards speaking practice and improvement as tests requiring 
oral production, especially the direct conversational interview. It 
may thus be desirable for pedagogical reasons to favor the direct 
testing of proficiency wherever possible. 

This discussion may be concluded with a few summary remarks. 
If oral proficienc\ is defined as the student's ability to communicate 
accurately and effectively in real-life language-use contexts, especial- 
ly in the face-to-face conversations typical of the great nriajor»ty of 
real-world speech activities, considerations of face validity appear to 
require human administration of a conversation-based test, which 
must also be evaluated by human raters. For this reason, direct inter- 
view techniques deserve continuing close attention and experimental 
study aimed at improving both the test administration and scoring 
procedures. The latter must be continuously reviewed to insure that 
they call for examiner judgments of the student's communicative 
abilit\ and effectiveness, rather than his command of specific linguis- 
tic features/ To permit practical and economical administration in 
the school setting, interview-based tests must also be designed to 
reach acceptable reliability levels within relatively short testing times. 

Proponents of direct proficiency testing can be encouraged by the 
limited but tantalizing data which suggest that these techniques are 
competitive with current standardized speaking tests in terms of both 
scoring reliabilit\ and overall cost. The higher level of face validity 
of the direct proficiency techniques, together with the considerable 
motivational value inherent in work-sample tests of communicative 
abilitv. would commend these techniques to langunge teachers and 
testers aliki for c()ntinuin>> investigation and increased practical use. 

*On rhi.s point, seo Clark 11)72. pp. 12I-I2J}. 

REFERENCES 

HiNinLsmoru. H. BtU-l«ins .ind A Rcnkin (1971 J "A T«!s! of Spoken Knj^lish " fnCKrnotion- 

al Huv'ww of AppVwd /.inguistics fJ:l. i-ll. 
BiiMun. Ko>*(.*r an(i (tiiri.in \.*\nv. (11)65) "A SoiMn.s(ni(:Honai Dovico for Condihoning 

Acciiraio Prosody " fnlcrnoliono/ Havww of Applied /linguistics 3.3. 205-211). 
Cnrroli. john B. |19H7). The Foreign Longuoge Anoinmenls of Longuogo Majors in thr. 

Senior Ypmt. A Survv.y Conducted in U.S. Colleges ond (iniversifies. Cambridge. 



.30 



24 Testing I.onguuge Proficiency 



Mass Laboratory for Research in Instruction. Harvard University Graduate School 
of Education. 

— . Aaron S, Carton, and Claudia P. Wilds (1959). An Invesligolion of "C/oze" Items 
in the Measurement of Achievement in Foreign Longuoges. Cambridge. Mass.: Labo- 
ratory for Research in Instruction. Harvard University Graduate School of Education. 

Cartier. Francis A. (1968). "Criterion-Referenced Testing of Language Skills." TESOL 
Quarterly 2:1, 27-32. 

Clark. John L D. (1972). Foreign Language Testing: Theory ond Froctice. Philadelphia: 

Center for Curriculum Development. 
Darnell. D. K. (1968). The Development of on Eng/ish Longuoge Proficiency Test of 

Foreign Students Using a C/ozentropy Procedure. Boulder. Co.: Department of 

Speech and Drama. University of Colorado. 
Davison. |. M. and G'.'M. Gcake (1970). "An Assessment of Oral Testing Methods in 

Modern Languages." Modern Longuoges 51:3. 116-123. 
Educational Testing Service (1965). Handbook: MLA-Cooperotive Foreign Longuoge 

Zests. Princeton. Jsj.|.: Educational Testing Service. 
Foreign Service Institute (1963). "Absolute Language Proficiency Ratings. " (Circular.) 

VVashmgton, D.C.. Foreign Service Institute. 
Oiler. |ohn VV . |r. (1972). ^Scoring Methods and Difficulty Levels for Cloze Tests of 

Proficiency in English as a Second Language." Modern Longuge fournol 56:3. 151- 

158. 

and Christine Conrad (1971). "The Clo7e Technique and ESL Proficiency." 

Longuoge Leorning21:2. 183-196. 
and »\evin Inal (1971). "A Cloze Test of English Prepositions. ' TESOL Quor- 

leriy 5:4. 315-325. 

Perrcn. George (1967). Testing Ability in English as a Second Language: 3. Spoken 

Language." Engh'sh Longuoge Teoching 22:1. 22-29. 
Pimsleur. Paul (1967). Pimsleur French Proficiency Tests -Monuo/. New York: Har- 

court. Brace & World. 

Pulliam. Robert (1969). Applicolion of the SCOPE Speech Interpreter in Experimenlol 

Kducotionol Progroms. Fairfax. Va.: Pulliam and Associates. 
Rice. Frank A, (1959). 'Thd Foreign Service Institute Tests Language Proficiency." 

Linguistic Reporter 1 :2. 4. 
Rude. Ben D (1967) "A Technique for Language Laboratory Testing." Longuoge Leorn- 

ing 17:3 & 4. 151-153. 

Spolsky. Bernard (1971) "Reduced Redundancy as a Language Testing Tool." In G. E. 

Perren f»»id I L M Trimm (cds.). Applicotions of Linguistics: Selected Papers of the 

Second fnlernotionol Congress of Applied Linguistics, Combridge. 1969. Cambridge; 

Cambridge University Press. 383-390. 
Bengt Sigurd. Masahito Sato. Edward Walker, and Catherine Arterburn (1968). 

"Preliminary Studies in the Development of Techniques for Testing Overall Second 

LangUfige Proficiency." Longuoge Leorning. Special Issue No. 3. 79-101. 
Starr. Wdmarth M (1962) "MLA Foreign Language Proficiency Tests for Teachers and 

Advanced Students " PMLA 77:4. Part II. 1-12. 
Taylor, Wilson L. (1953) 'Clo^e Procedure, A New Tool for Measuring Readability." 

/ourno/ism Quorter/y 30:4. 414-438. 



DISCUSSION 



Spolsky: There s one Ihin^ that mi«ht ho worth thinking about that I think 
>oii excliichHl. and thdl i.s that tht? oral interview and so on comes out to be 
^■•nply a (:onversalu)n. There is also the possibilit\ of considering the com- 



ERIC 



31 



Theoretical and Technical Considerations in Oral Proficiency Testing 25 

munication task as a test, the kind of situation where the examinee sits in a 
room, the telephone rin^s. he picks it up, somebody starts speaking to him in 
another language, and he has a choice of either using that language or trying to 
avoid using it. The other person is trving to get directions, and either he does 
get to the place he's supposed to or he doesn't. Yon can say at the end of the 
test that either he was capable of communicating or not. This kind of com- 
munication task test is one in which the judgment of its effectiveness is 
whether or not the speaker communicates with the listener. It would be the- 
oretically possible to set this up in such a way that you have a mechanical 
rather than a human judgment. The problem of deciding what the qualities of 
the listening person need to be is one thing to be taken into account. But a 
person could be given mechanically a certain piece of information to com- 
municate to a second person, the second person performs the task, ami if he 
performs it successfully, then mechanically this could be scored in such a 
wa>. From the results of previous experiments, there appears to be a way of 
.testing communication ability, which is the speaking side, that has absolutely 
no correlation with uther indirect measures of language ability. 1 wonder if 
you'd perhaps like to comment on that? 

Clark: I'm fairly familiar with that and similar techniques. I'd say certainly 
an> and all testing techniques we can devise or think of merit consideration. 
The question would be whether we'd be willing to call this kind of thing a 
face valid direct test of proficiency. My own inclination would be to stick 
with the real conversational situation as the criterion test, and then hope that 
we could develop a correlation of .99 or thereabouts between the face-to-face 
interview and some other kind of measure. 

Lado: I don't think there is any meri* in face validity; face validity means the 
appearance of validity. I think that there are questions concerning the inter- 
view from the point of view of sample, and I think that the interview is a poor 
sample. For example, most interviews don't give the subject a chance to ask 
questions. He gets asked questions, but he doesn't ask them. And it seems to 
me that asking questions is a very important element of communication. 
Second, the interview will usually go on to some limited number of topics. 
Who is able to produce 100 different original topics of conversation with 100 
different subjects? Therefore, it may not even be a very good sample of situa- 
tions. So I think that the question of the validity of the sample itself isn't 
proven. Then, it's been mentioned by everybody that the interview is highly 
subjective. There is what can be termed a "halo effect." I'd hate to be inter- 
viewed after somebody who's terrific, because no matter what I am. I'm going 
to be cut down. I'd like to come after somebody who got a rating of 04-. then 
my chances of showing up are better. There's the personality of the inter- 
viewer and interviewee. There's also the fact of accents. Sociolinguisdcs has 
shown that we react differently to different accents. For example, the Spanish 
accent in an English-speaking test will tend to rate lower than a French or a 
German accent, or some other accent like that. There is also the problem of 




26 Testing Longuoge Proficioncy 



keeping the level of scoring more or less even. li s true that \ou c<in record 
these interviews and go back to them, but it s more likeK that there will be 
some drifting avva\ or raising of standards as \ou go I think the scoring of 
nine or ten or eleven points is coarse li s a mixed bag. and it s all right per- 
haps for certain purposes, but if we ha\e to use this interview six \ears in a 
row in a lanjjuage sequence, we would find that a lot of students would re- 
main at I for fi\e \ears. We mijjht conclude that the\ haven't learned any- 
thing, but I think there might be finer wa\s of finding out if the\ have learned 
something, if in fact the> have. I think that the interview is a poor test of 
listening And I certainK go along with the CIA on this-they have a separate 
listening test How man\ questions do \ou ask an interviewee? I'm sure the 
reliabilitv of the listening part would be very poor. Finally. I think the inter- 
view mixes skills with proficienc\. and I think Clark is on the right track in 
his book when he sa\s \ou can't do both of them in one interview. You're 
either after proficienc\. and don't get down to the specifics, or you get down 
to the competence, and there are better wa\s to do this than the interview. I 
am in disagreement with Clark's pejorative intimation concerning indirect 
techniques, and his favorable "halo" toward direct techniques. 
Clark; Let's discuss that later. 
Anon.: How long does it lake to train a tester? 

Clark: Our Peace Corps experience might be helfpul in answering that ques- 
tion. We think that we're able to train a tester in 2 days of face-to-face work 
and discussion, preceded bv a couple of days of homework on his part- 
reading an instructional manual, listening to sample tapes and so forth. I'd 
suggest thot this kind of time requirement is pretty much in line with the 
amount of time it takes to train someone to score the MLA COOP tests, for 
example. So I think we can be cost-effective in terms of the training time of 
the interviewer. 

Anon.: As I understood the FSI technique. 95 percent cf thii raters agreed in 
the rating that was given. Is that correct? 

Clark: First let me say that it was a fairly small-scaled sUidy. Some 80 inter- 
views were examined. We need a much more comprehensive study of this. 
But of those 80 interviews, two raters were simultaneously present during the 
interview. Then at the end of the interview thev independently rated on the 
basis of 1 2 3 4 5. not 1+ vs. 2. for example. But within the categories 1 2 3 4 5. 
95 percent of their ratings were identical 

Anon.: Isn't it odd that there were correlations of .31 in the other types of tests 
that were given? 

Clark: Yes, I think that's very interesting. I hoped that that would come 
across. 

Scott: I question whether a one-shot test is really adequate. 
Clark: If you are talking about determining a student's proficiency at a 
specific point in time, rather than determining any sort of growth that he 
m.ikes. I would say that a one-shot test is sufficient, provided that the test is 





Theoreticdl and Technical Considerations in Orai Proficiency Testing 27 



a valid and reliable rupre.sontatiun of his abditv. If we find that v\ithin the 
space of 2 or 3 days he's administered the test five times and he gets widely 
varying scores, then our test is in trouble. But if we have a test which can 
reliably evaluate on a "single shot " basis, all the better. 

Spolsky: Ma> I just make one brief comment on that? As I remember we 
talked about this problem a cou[)le of years ago. that's the problem that 
[jroficiency tt?sts are also usetl as [jredicturs of how peo[)le will perform when 
[)Ui into a new language environment. The question was raised then that, 
while vou may have two people at exactly the same point on the proficiency 
scale, you do want to know which of them, when thrown comfjielely into the 
language speaking situation, will learn faster, and I think that's a fairly strong 
argument for a two-shot test or a kind of test that will also find out at what 
{)Oinl on the language learning continuum the learner happens to be. 
Oiler: I'd like to make three quick comments. I want to agree very strongly 
with what John Clark said about the oral interview and the reasons why he 
thinks that's a realistic kind of thing to demand of people. Unfortunately, 
natural situations sometimes generate tension, and I dnn't think that's an 
argument against the interview. The second comment is that it seems to 
nif that there's anotiier kind of validity thai correlational validity is in evi- 
dence fur And I vvuuld suggest a term somtlhing like psycholinguistic valid- 
ity it's something that has to do with what is, in fact, in a person's brain that 
enables him to U[)erate with language. And if we're lapfjing into that funda- 
mental mechanism, then I think we have a deeper kind of validity than face 
validity or correlational validity or some of the others. Correlational validity 
is. I think, evidence of that kind of deeper validity. The third comment is 
that, in reference to the low correlation on the mimicry test. I think that that's 
very possibly due to the fact that short phrases were used. If longer phrases 
were used that challenged the short-term memory of the person being inter- 
viewed uPi d forced him to operate on the basis of his deep, underlying system 
or grammar. I think the test would yield much higher validity. 
Clark: Perha[)s the 31 correlation for mimicry could be increased, as you 
suggest, by having longer sentences or something similar. But I think the 
general point is still valid that, if you look at the test manuals or handbooks 
for these tests-the Pimsleur Test manual, for example — you'll find no relia- 
bility figures for the scoring of the speaking test, and you'll find a caution to 
the effect that the score ranges must be interpreted very carefully, or words to 
this effect. If you look at the MLA COOP handbook, you will find reasonably 
low correlation figures and also cautions against misinterpretation and so 
forth. So I think that, as a general principle, the "high correlations" of tape 
recorded sfjeaking tests are more fiction than fact. 

Davies:, Can I make two or three quick comments? First of all. following up 
some of the [)oints made about validity. Mr. Clark distinguishes face validity 
and concurrent validity and relates these to his indirect and direct methods. 
I'd like to see content validity mentioned as well. I think in a way this is 




28 Testing Language Proficiency 



what is behind some of Professor Lado's remcirks. If conlenl validity is u^sed, 
would >ou then be engaged m direct or indirect testing? And. would the 
psvcholinguistic thing wo just mentioned be considered construct validity? 
Finally. Id like to comment on the question about the one-shot proficiency 
testing. It seems to me to be a function of the reliability of the test. 
Clark: To take the last comment first, I think we are together on the ques- 
tion of the one-shot test. I said if the test is a reliable indication of ability in 
the sense that it can be repeated with the same score, why give all the differ- 
ent tests rather than the one? I think the question of construct validity or 
psvcholinguistic validity, however we want to talk about it. will be coming 
up again Regarding the first question, content validity vs. face validity, I may 
have given a slightly wrong impression about what I think face validity 
involves Face validity for me would be careful examination by people who 
know their stuff, language people and language testers look at the test, at 
what it's got in it. at the way it's administered, at the way it's scored, in other 
words they look at the whole business of it. and this is face validity in my 
sense, as opposed to a statistical correlation validity. True, we don't want to 
rule out very close scrutiny of the test, and I think we'll keep that under the 
term face validity. 



ERLC 



35 



The Oral Interview Test 

Claudia P. Wilds 



Sir.;.e 1956 the Foreign Service Instilule of the Deparlmenl of State has 
bten rating Government employees on a simple numerical scale 
which siiccincth tlescribes speaking [)roficienc\ in a foreign language. 
This scale has become so vvulely known and well understood that a 
reference to a [)uint on the scale is immediatoh and accurately intelli- 
gible to most [)eople concerned with [jersonnel assignments in the 
numerous Government foreign affairs agencies who now use the FSI 
rating s\stt?m. 

The usefulness of the s\stem is based on careful and detailed defi- 
nition, in both linguistic and functional terms, of each [)oint on the 
scale. 

This pa[jer is concerned, first, with a desGri[)tion of the testing pro- 
cedures and evaluation lechni(|ues vvhereb\ the rating system is 
currenlh ap[jlied at the Foreign Service Institute and the Central In- 
telligence Agencv and, second, with the [jroblems that seem to be in- 
herent in the system. 

BACKGROUND 

Prior to 1952 there v\as no inventory of the language skills of Foreign 
Service Officers and. indeed, no device for assessing such skills. In 
that vear. however, a new awareness of the neetl for such information 
led to preliminary descrii)lions of levels of [)roficienc\ and ex[)eri- 
mental rating procedures. B\ 1956 the [)resent rating system and test- 
ing methods had been de\eloi)ed to a [)raclicable degree. 

Brjlh the scupo and the restrictions of the testing situation [)rovided 
[)roblems and requirements previously unknown in language testing. 
The range of these uni(jue features is indicated below: 

• The need to assess both s[jeaking and reading proficiency within 
a half-hour to an hour. The requirement was imposed principally by 
the limited time available in the examinee's crowded schedule. 

• The need to measure the com[)lete range of language competence,, 
from the skill accjuired in 100 hours of training or a month of experi- 
ence abroad lu the native facility of someone who received his entire 
education through the foreign language. 

• A [)o[)ulalion consisting of all the kinds of Americans serving the 
United States overseas, diplomats at all stages of their careers, secre- 



29 



ERIC 




30 Testing Longudj^t; PrnficUmcy 



Icirios, ci^ricullural specialists. PecU.e Cavpa volunltjers, soltliers. lax 
e\|)erts, and nian\ others. Thtn mi^ht hase learned their lan^ua^e 
skills at homt*. on llir jol). or through formal training, in an\ combina- 
tion and to an\ d(?^ret». GtMioralK no l)io^ra[)ln(:al information was 
available beforehand. 

• The necessity for a rating s\st(jni a[)[)licable to an\ language: eas\ 
to interpret l)\ examiners, examinees, and supervisors: and imme- 
clialeK useful in decisions al)out assignmenis. promotions, and job re- 
(juirtMuenls. 

• The need for uncjuestioned face validitv and reputation of hi^h 
relial)ilil\ among those v\ho take Hie test and those who use the re- 
sults. 

With these restrictions there was. from the l)eginning. very little 
choic(» in tlu* kind of test that could be given. A structured interview 
(.ust()m-l)uill to fit each (»xamine(»'s ex[)erience and capabilities h: the 
language promised to use iho timtJ allowed for the lest with maximum 
(»fficienc\. A rating scale, v\ilh units gross enough to ensure reasonable 
relia!)ilit\. was developed on the basis of both linguistic and functional 
analyses. The definitions, which appear at the enti of this article, area 
niodifi(»d versicm worked out b\ representatives of FSI. the CIA., and 
the Defens(» Language Institute in 1968 to fit the characteristics of as 
broad a population of Government ejm[)lo\ees as possible. 

PROCEDURE 

The t(»siing team at KSi consists of a native s[)eaker of the language 
hv'm^ tested and a c(?rtified languagtJ examiner who ma\ l)e either an 
ex|)erienced nati\(;-speaking language instructor or a linguist thor- 
oughK familiar with tht? language. At thtj CIA two native speakers 
who are language instructors conduct the test. 

The usual speaking test at KSI is conducted l)\ iho junior member of 
the t(»sting team. v\ho is alv\a\s a nati\e s[)eaker. The senior member, 
who normalls has nati\'(j or near-native Hnglish. observes and takes 
notes. To the greatest (?xtenl [)ossil)l(j the interview apptjars as a re- 
laxed, normal conversation in vxhich the stmior tester is a mostly si- 
lent but interest(Ml partici[)ant. At the CIA the two inttjrviewers lake 
turns partici|)ating and ol)ser\ ing. The [)rocedures to be descril)ed here 
are primariK those which art; ustjd at KSI. which can normalK take 
advantage of ha\ing one examiner v\ho is a nativtJ speaker of English. 

Th(» test begins with sim[)le so(.ial formulatj in (he language l)eing 
tested, introductions. (.ommtMits on the weather, (juestions liktj. "Have 
yon jusl come back from overseas?", or "Is this thtj first time you've 
taken a lest here?" 

The examinee's succ(\ss in res[)onding to thest; opening remarks will 
determine the course of the rest of the test. If he fails to understand 




3-7 



The Oral Interview Test 31 



some of them, e\ en v\ith reiJtli^ion tinil rephrdsin^. or does not cinsvver 
easih. dt least a [)reliminar\ ceiling is put on the level of questions to 
1)0 iisked. He will be asked as simpK as possible to talk about himself, 
his famiU. and his work, he ina\ be asked lo ^hv, street directions, to 
pla\ a role (e.^. renting a house), or to act as inter[jreter for the senior 
tester on a tourist h?\ el. RareK, he ma\ handle these kinds of [)rob- 
lems well enough to be hul on lo discussions of current events or of de- 
tailed as[)ects of his job. Usualh he is clearK [)ej»^ed at some point be- 
low the S-2 rating. 

The examinee who co[)c»s adoquciteh with the [jreliminaries gen- 
eralh is led into natural con\ersation on autobiogra[)hical and profes- 
sional topics. The ex[)erienced interviewer will simultaneoush at- 
tem[)t to elicit the grammatical features that need to be checked. As 
the (pu'stions increase in com[)lexit\ and detail. Ihe examinee's limita- 
tions in \ocabular\, structure, and comjjrehension normalK become 
apparent quite ra[jidl\. (A com[)etent team usualh can narrow the ex- 
aminetj's graile to one of two ratings within the first five or ten min- 
utes; the\ spend the rest of the inter\ iew collecting data to verify their 
preliminary conclusions and to make a final decision.) 

If the examinee successfulh avoids certain grammatical features, if 
the o[)[)ortunit\ for him lo use them does not arise, or if his compre- 
hension or fhumL\ is difficult to assess, the examiners ma\ use an in- 
formal iuter[)reting situation a[jpro[jriate to the examinee's apparent 
le\'el of proficienc\. If the situation is brief and [)lausible and the in- 
terchange vields a sufficient amount of linguistic information, this 
technique is a valuable sui)[)lement. 

A third element of the s[)eaking test, again an o[)tional one, involves 
instructions or messages which are written in Knglish and given lo 
the examinee lo be convened lo the native s[)eaker (e.g. "Tell your 
landlord that Ihe ceiling in the living room is cracked and leaking and 
the sofa and rug are ruined.") This kind of task is [)arlicularly useful 
foi examinees v\ho are highh [)roficienl on more formal topics or who 
indicate a linguistic self-confidence that needs to be challenged. 

In all as[)ects of the interview an atlem^)t is made to probe the ex- 
amine«)'s functional com[)etence in the language and to make him 
aware of both his capacities and limitations- 

The speaking test ends when both examiners are satisfied that they 
have [)ini)ointe(l the a[)pro[)riate S-rating. usuallv within a half hour or 
less. 

EVALUATION 

When the interview is over, the examiners at FSI inde[)endently fill 
out the '^Checklist of Performance Factors" with which they are pro- 
vided. This checklist, reproduced at the end of this article, records a 





32 Testinx I.anxu«K'* Proficianvy 



profile of the examinotrs roldlivo .slron>»th.s .ind uiMkiujsses. but u.is 
designed principalK It) lorLo edch t^xaminur lo consider (ho fi\e ele- 
monls involved. 

A vvei^ihled scoring* s\s(t?m lor lIuM.hei.klisl has l)et»n derived from a 
multiple correlation with the overall S-ratin^4 assi>,'ned (R=.95). 
The vvei>,'hts are l)asi(:ally these: Accent 0. Grammar 3, Vocabulary 2, 
Fluenc\ !, Com[)rehension 2. Partiv l)ecause thi? ori>,Mnal data came 
mainh from tests in Indo-Kuroi)ean lan^^uaj-es and i)artl\ because of a 
widespread initial sus[)icion of statistics anions,' the staff, use of the 
scoring* system has never l)een made com[)ulsor\ or even iir>»ed, 
IhouMh the examiners are re(|uired to comi)lete the checklist. The re- 
sult has been that most examiners compute the checklist score onlv in 
cases of doul)t or disa>,'reement. Nevertheless, the occasional verifica- 
tions of th(» checklist ^)rofiles seem to keep examiners in all lan>,'ua>,'es 
in line witli each ether (in the sense that an S-2 in japanese will have 
much the same ^)rofile as an S-2 in Swahili): and those who once dis- 
trusted the svstem now* faith in it. 

To the trainiMl examiner each blank on each scale indicates a quite 
specific pattern of behavior. The first two scales. Accent and Gram- 
mar. ()b\iousl> indicate features that can l)e described most concretelv 
for each lan>,UKi>,'e. The last three refer to features that are easy to 
equate from lan>iiia>,'e to lan>,'ua>^e but difficult to describe except in 
functional terms and i)robabl\ dan>,'erons to me»^sure from so small 
a sample of s[)eech on a scale more refined than these six-point ones. 

The checklist does not ai)ply to S-Os or S-5s and thus reflects the 
nine ralin>,'s from S-0+ to S-4+. Since each of the checklist factors is 
represented on a scale with only six sediments, a check placed on a 
particular scale indicates a degree of com[)etence not necessarily lied 
lo a specific S-rating. The mark for Grammar for an S-3, for example, 
may fall anywhere from the third to the fifth segment, while an S-3's 
comprehension is tvpically in the fifth or sixth segment. In any case, 
the examiner is prevented from putting down an unconsidered column 
of checks to denote a single S-rating. 

The rating each examiner gives is normally nt)t based on the check- 
list, however, but on a careful inter[)retation of the ami)lifie(i defini- 
tions of the S-ratings. It might be said here that successful interi)reta- 
lion depends not onh on the [)erceptiveness of the examiner i)ut at 
least as much on the thoroughness of his training and the? degree to' 
which he accepts the traditional meaning of every ^)art of each defi- 
nition. 

The actual determination of the S-rating is handled differently from 
tea,n to team at FSL In some cases the two examiners vote on paper, 
in others one suggests a grade and the other agrees or disagrees and 
^•'^ es his reasons for dissent, in some a preliminary vote is taken, and 




an 



The Oral interview Test 33 



dis.Wtiomonl loads to further oral lostin^ until accord is rcMched. If a 
half-point discropancN cannot he rcsoUed h\ (Hscussion or avera>^in>^ 
of the (.omputod scones from Iht? (.ht^cklist. th(J general ruh; fo!h)ue(l 
at l'*SI is that the hjUt;r rating is ilwun. (The rationah; for this rule is 
that the rating is a promise of performan'.e made h\ i-'SI to assi>>nment 
offit.ers and future; suptirx isors. The coiise(juences of overrating? are 
more serious than the consequences of un(lerratin>>» however dis- 
appouitin>> thi.' marginal decision ma\ be to the examinee himself.) 

At the CIA each tjxanuner. wiihout discussion. independentU makes 
a nicirk on a segmented five-inch line whose polar points are 0 and 5. 
The* distance from 0 to the mark is later measured with a ruler and 
the two lengths are a\era^eil for the final ratin>?. CIA test(jrs tend less 
to anah/e the examinee s performance in detail; functional effective- 
ness is the ovcM ridin^ criterion. 

PROHl.EMS 

To thoNO who have little or no familiaritx with the ratin>* system 
just (lescrihed. theni niti\ he a do/.en reasons that come to mind why 
it should not work well (.'nou>>h to he a practical and equitable proce- 
(lure, Mo.st of tin; troublesome elements have b\ now be(;n removed 
or made tcderable l)\ the necessit\ for facing them repeatedly. The 
articulate an^er of a I'orei^n Service Officer who feels his career 
threatened b\ a low rating is enough to make those who ^ive such a 
tatin)> aware that the\ must be able to defend it. and the occasional but 
vigorous (.omplainls. especialU in the earh \ears, have done much to 
shape and refine the procedures. 

One issue, for example, which has been resolved at the cost of many 
chcdlen^es is tht; (juestioi. of a(:ce|)lance h\ the examiners of social 
dialects which are not accepted b\ most (;ducale(l native speakers of 
the lan^ua>»(;. Althou>>h manv em|)lo\ees of the forei>?n aid program 
*in(l perhaps a ma)orit\ of Peace Corps volunteers work with illiterate 
and senii-literate people, it was di;ci(led that making non-standard 
sp(.'ech and standard speech e(jual|\ acce|itable would make a sham- 
bh;s of the .^\slem. in large part because foreign speakers' errors are 
oft(*n i(lenti(.al with the patterns of uneducated native speakers. By 
inbi.sting an \\ut criteria developed for the speech of Foreign Service 
(Jffic(;r.s-, who obviouslv must speak the standarii dialect, we avoided 
having to evolve several sets of rating definitions for other Govern- 
ment agencies. 

The problems that are inherent in the system do not include reli- 
ability among raters of the same performance. Independent judg- 
ments on taped tests rarely vary more than a half-point (that is. a 
plus] from the assigned rating. A more serious issue is the stability of 
performance with different sets of interviewers. Because this kind of 



40 



34 7es(jn>? Limi^uui^i; Pwficwncy 



testin>» is so expensive, immediate retestin^ is not permit!(ui. espe- 
ciiilK if it is onK for research purposes. Consequently, there are two 
Ie>»itimale and interesting questions that FSI (.annot ansvv(?r: (1) Hoes 
the proficiency of the speaker of a foreijjn lan^ua^(? fhictuatc; measur- 
ably from day to davV (2| Does h^s p(?rformance vary with the com- 
petence and efficiency of th(? examiners? 

Individualizing the content of each interview has always seemed 
the best way to make optimum use of the time available. Hut this free- 
dom that the int(?rviev\ers hav«? allows for th(? development of several 
kinds of in(»fficienc\ . The most common is th(j failure to push the more 
proficient examinee to the limits of his linguistic competence, so that 
(lata are lacking to make a reasonable decision between two grades. 
Often the intcdlectual abilit> to discuss a difficult topic may be con- 
fused with linguistic abilit>. although the structurf\< and vocabulary 
us(»d ma\ be rel.itivel\ simple ones. Another danger is the possibility. 
especialK when both intervievxers are native speakers of the language 
being tested, that both will participate so activcdy in the conversation 
that, for one thing, the examinee gels little chance to talk. and. for 
another, neither examiner keeps track of the kinds of errors being 
made or the t\p(»s of structures that have not been elicited. The inter- 
view is d(»sign(»d to b(» as painless as possible, but it is not a social 
occasion, and the rating assigned can only be defended if it is based 
on a detailed anal\sis of the examinee's performance as well as on a 
general impression. For this same reason one examiner testing alone is 
likeU to lose both his skills as an interviewer and his perr^^^-'Mveness 
as an observer to a degree that cannot be justified on th^^ /.inds of 
<»conomy. 

There is thus a continuing possibility that the (jxaminee may not be 
given the opportunity to provide a fully ade(juate sample of his 
speech and that the sample he do(»s provide is not inspected with ade- 
(|uate attention. The obvious wav to minimize the chances of this hap- 
pening is through a rigorous training period for new examiners; inter- 
mittent programs of restandardizing; and. where possible, shuffling 
members of a testing team with great frequency. 

Thi} training of testers at FSI has improved greatly in recent years, 
largeh because of the task that the staff had for several years of test- 
ing vast numbers of Peace Corps volunteers and then teaching others 
how to do so. In languages which are t(»ste(l often ther^j are good 
libraries of tapes of tests at all levels which the new interviewer can 
us(» to learn first the rating s\stem and (hen the testf^ng techniques 
before he puts them into practice in the testing room. There is also 
a substantial amount of writt^m material aimed at clarifying standards 
and suggesting appropriate techniques, as well as a staff that novv has 
Q * flars of experience in guiding others in testing competence. 

ERIC 

"~° 41 



The Oral Interview Test 35 



Difficulties arise chiefly in langaages that are tested so rarely that 
it is hard for ihe interviewers to internalize standards or to develop 
facility in conducting interviews at levels appropriate to different de- 
grees of proficiency. In a number of languages the majority of tests 
are given in a week's time several tunes a vear to graduating students 
whom the examiners know well and whose range of proficiency is 
relatively narrow. The rest of the tests in that language may number 
no more than a half dozen scattered throughout the year,, at unpredict- 
able levels of competence. It is too often the case that the native speak- 
er interviewing in such a language knows no other language that is 
tested with more fretjuencv,, and it has been true more than once that 
the senior tester involved is equally restricted. At the same time, no 
one else on the staff may be familiar with the language involved. 
When this happens, the testers of that language cannot be adequately 
trained, tests cannot be effectively monitored, and both standards and 
procedures mav diverge from the norm. In such cases one can only 
have faith in the clarity of the guidelines and the intelligence and 
conscientiousness of the examiners. (One form of control could be a 
periodic analvsis of recorded tests by a highly qualified tester of an- 
other lanj;uage who would go over the tapes line by line with the 
original interviewers.) 

Even in l.inguages in which tests ire conducted as frequently as 
French an(i Spanish, where there is no doubt that standards are in- 
ternalized and eli(;itation techniques are mastered, it is possible for 
criteria to be tightened or relaxed unwittingly over a period of several 
vears so that ratings in the two languages are not equivalent or that 
current ratings are discrepant from those of earlier years. 

The fact of the matter is that this system works. Those who are sub- 
ject to it and who use the results find that the ratings are valid, de- 
pendable, and therefore extremely useful in making decisions about 
job assignments. It is, however, very much an in-house system which 
depends heavily on having all interviewers under one roof, able to 
consult with each other and share training advances in techniques or 
solutions to problems of testing as they are developed and subject to 
periodic monitoring. It is most apt to break down as a system when 
examiners are isolated by spending long periods away from home 
base (sa> a two-year overseas assignment), by testing in a language no 
one else knows, or b\ testing so infrequently or so independently that 
they evolve their ov*^n system. 

It is therefore not ideal for the normal academic situation where all 
testing comes at once (making it difficult to acquire facility in inter- 
viewing ahead of time) and where using two teachers to test each stu- 
dent would be prohibitively expensive. It can be and has been applied 
•n high schools and colleges where the ratings are not used as end-of- 




36 Testing Language Proficu^ncy 



course grades but as information about the effectiveness of th(? teach- 
ing program or as a way of discovering each student's ability to use 
the language he has been studying. 



The rating scales described below have been developed by the Foreign 
Service Institute to provide a meaningful method of characterizing the 
language skills of foreign service personnel of the Department of 
State and of other Government agencies. Unlike academic grades, 
which measure achievement in mastering the content of a prescribed 
course, the S-rating for speaking proficienc\ and the R-rating for read- 
ing proficiency are based on the absolute criterion of the command of 
an edticated native speaker of the language. 

The definition of each proficiency level has been worded so as to 
be applicable to every language: obviously the amount of time and 
training required to reach a certain level will vary widely from lan- 
guage to language, as will the specific linguistic features. Neverthe- 
less, a person with S-3s in both French and Chinese, for example, 
shotdd have appro.ximateU equal linguistic competence in the two 
larigtiag(;s. 

The scales are intended to appU principalU to Government person- 
nel engaged in interna!ional affairs, especially of a diplomatic, po- 
litical, economic, and cultural nature. For this reason heavy stress is 
laid at the upper levels on accuracy of structure and precision of vo- 
cabular\ sufficient to be both acceptable and effective in dealings with 
the educated citizen of the foreign country. 

As currently used, al! the ratings except the S-5 and R-5 may be 
modified b_v a plus,( + ). indicating that proficiency substantially ex- 
ceeds the minimum r*;quirements for the level involved but falls short 
of those for the ne.xt higher level. 

DKKIMTION'S OF ABSOI.UTt: RATINGS 

Elementary Proficiency 

S-1 Able to sotisfy routine travel needs and minimum courtesy re- 
quirements. Can ask and answer questions on topics very familiar to 
him: within the sco[)e of his ver\ limited language experience can un- 
derstand simple questions and statements, allowing for slowed speech, 
repetition or paraphrase: spejaking vocabulary inadequate to express 
anything but the most (?l(?menlary needs: errors in pronunciation and 
grammar are? frecjuent. but can be understood by a native speaker used 
to dealing with foreigners attempting to speak his language: while top- 
ics which are "very famili jr" and elementary needs vary considerably 
[rnrn individual to individual, any person at the S-1 level should be 



FSI Language Proficiency Ratings 




43 



The Oval Inlervieiv Test 37 



able to order a simple meal, ask for shelter or lodging, ask and give 
simple directions, make purchases., and tell time. 

R-1 Able to read some personal and place names, street signs, office 
and shop designations, numbers, and iso/4#ted ivords and phrases. Can 
recognize all the letters in the printed version of an alj)hal)etic s\stem 
and high-frequency elements of a syllabary or a character system. 

Limited Working Proficiency 

S-2 /\b/e to satisfy routine social demands and limited ivork require- 
ments. Can handle with confidence but not with facility most social 
situations including introductions and casual conversations about 
current events, as well as work, famiU. and autobiograi)hical informa- 
tion, can handle limited work reijuiremenls. needing help in handling 
an> complications or difficulties; can get the gist of most conversations 
on non-technical subjects (i.e. topics which recjuire no specialized 
knowledge) and has a speaking vocabulary sufficient to express him- 
self simpU with some circumlocutions; accent, though often quite 
fault\. is intelligible; can usually handle elementary constructions 
(juite accurateU but does not have thorough or confident control of the 
grammar. 

R-2 Able to read simple prose, in a form equivalent to lypescripl or 
printing, on subjects tvithin a familiar conte.xt. With extensive use of a 
(lictionar\ can get the general sense of routine business letters, inter- 
national news items, or articles in technical fields within his compe- 
tence. 

Minimum Professiona/ Proficiency 

S-3 Able to speak the language with sufficient structural accuracy 
and vocabulary to participate effectively in most formal and informal 
conversations on practical, social, and professional topics. Can dis- 
cuss particular interests and special fields of com|)etence with reason- 
able ease; comprehension is quite complete for a normal rate of 
speech; vocabulary is broad enough that he rarely has to grope for a 
word; accent may be obviously foreign; control of grammar good; 
errors never interfere with understanding and rarely disturb the 
native speaker. 

R-3 Able to read standard newspaper items oddressed to the general 
reader, routine correspondence, reports and technical material in his 
special field. Can grasp the essentials of articles of the above types 
without using a dictionary; for accurate understanding moderately 
frequent use of a dictionary is required. Has occasional difficulty with 
unusually complex structures and low-frequency idioms. 

Full Professional Proficiency 

Q S-4 Able to use the language fluently and accurately on all levels 




44 



38 Testinj? Lunj»U(i>»#; Profi(;i»;nr;v 



normally p^rhnent to professional nf^'ds. (Ian understcind and ptirtici- 
pate in an> conversation within the ranj^e of his experience with a 
high degree of fhiencv and precision of vocahularv; would rarely he 
taken for a native speaker, but can respond appropriately even in un- 
familiar situations; errors of pronunciation and grammar (pnte rare; 
can handle informal interpreting from and into the language. 

R-4 Able to read all styles and forms of the language pertinent to 
professional needs. With occasional use of a dictionary can read 
moderatelv difficult prose readilv in tin) area directed to the general 
reader, and all material in his special field including official and pro- 
fessional documents and correspondence; can retid reasonabl\ legible 
handwriting without difficulty. 

.Vativeor Bilingucd Proficiency 

S-5 Speaking proficiency equivalent to that of an educated native 
speaker. Has complete fluenc> in the language such that his speech on 
all levels is full> accepted b> educated native speakers in all of its 
features, including breadth of vocabulary and idiom, colloquialisms, 
and peninent cultural references. 

R-5 Reading proficiency equivalent to that of an educated native. 
Can read extremely difficult and abstract prose, as well as highly 
collo(|uiaI writings and the classic lilerar\ forms of the language. With 
varving degrees of difficulty can read all normal kinds of handwritten 
documents. 

Checklist of Performance Factors 

1. ACCKNT foreign _ : native 

2. CRAMiVIAR inaccurate _ accurate 

3. VOCABULARY inadecjuate „ _ ade(}uate 

4. FLlJKiVJCY uneven _ _ _ even 

5. COMPRKHKNSION incomplete _ : _ complete 



DISCUSSION 

Nickel: Crn parhctilarlv inhrt'stfd in (;v alutitions. In connection with this. 
Wiis ihnTK an> partu.nitir retisou for vxMi^htinu grtinunar with 3 points over 2 
points on Ihc* vncahniarv side? 

Wilds: It was decidtMl sltitistu.allx We had some 800 peo[)!t! fill out \h(i check- 
^ then correlated it with the ovcTall S-ratin^ (hev assignor!. 



Jhe Oral Interview Test 39 



Nickel: Has there been an> allenif)l to arranj^e these factors in hierarchical 
order, uilh preference j^iven to the \ocahular> side or to the j^rammatical 
side'' 

Wilds: Accor(iinji to the weights I think grammar is considered the most 

important of the five. 

Nickel: Is there a linguistic l)asis for this? 

Wilds: Mo. 

Petersen: You encoura.ue people to ignore accent? 

Wilds: The fact is that the> essentially do ignore it once the speaker is in- 
telligi[)le. 

Jones: Could I just sa> con(.erning language testing, or an> testing for that 
matter, there is in addition to face validity the initial reaction on the part of 
the person looking at this t>pe of test? Almost without exception all the peo- 
ple I knou who have seen or heard a[)out an oral interview test for the first 
lime react uilh shock. It can't l)e done. It's too sul)jective. There's no way to 
evaluate it This was m> reaction too when I was first exposed to it. But 
after having oh.served or participated in more than 100 oral interview tests. 
I find that it's a verv valid svstem. First of all. in the training of the testers 
we don't onlv use these definitions that have [)een passed out to \ou toda>. 
The.sr are onlv for the consumer, to indicate roughlv what the levels are sup- 
ptjsed to he. \'ew testers have to [)e told in great detail what is to lie expected 
on the part of the examinee in terms of content as well as the .structure of the 
langu»ige After the training pernul. the> do have a prettv good intuitive idea 
fif what a 2-level s[)eaker is .suppo.sed to he al)le to do. We are in the process 
now of doing a v.ilidilv studv -a cro.s.s-agencv studv in three different lan- 
guages -and we are finding that the relial)ilitv is verv good In other words, 
thf tester dues have a good idea of what the various levels are snf)posed to 
l>e in tiTins of performance As far as fright is concerned, in ol)Serving many 
testb 1 have found that it does occur. [)Ut pnmardv onlv initiallv. A good tester 
can set the stage to [)e able to minimi/e this shock. I might add that manv of 
us have looked around and have found nothing suittd)Ie for our purposes to 
takr the place of the oral interview test. It has to [)e a test which, as much as 
po.ssi[)li'. can rrcreale the situation the person is going to be exposed to when 
he has in use the language I'd like to ask |ohn Quirtones to explain the scale 
and use of the ruh'r at the CI.X. and a[)out the independent rating system. 
Quinones: When I first had to deal with testing at the Central Intelligence 
Agencv. I found that the two testers would consult with each other, and if 
thev differed, thev would write the rating down on a piece of paper, discuss 
It further, and then decide which rating thev were going to assign the individ- 
u<il I thought this wasn't a v(»rv good idea, because one tester might tend to 
l)e a bit more dominant than the other, or one might have more experience 
than the othiT I was afraid that in manv Instances (;ne rater, in spite of the 
fad that he might have the wrong rating, would [)e the dominant rater. In 
)rdi r to avoid (hi.s. we deveIo[)ed a .svstem in which raters would rate inde- 




40 Testing l.tinguaxe Proficiency 



pendentK using a scale with defined levels. Instead of discrete items on a 
fii\en scale. the> were defined as ranges. The testers, without discussing any- 
thing whatsoever, would indicate l)\ writing within a given range, let's sa\ the 
range of the 2 or the range of the 3. a line indicating how high or how low the 
person was in that range Then, without an\ dibcussion. these sheets would be 
taken to a scorer, who. using a ruler divided into centimeters, would then 
measure each rating, average them, and arrive at a final rating If a dis- 
crepancy existed b> more than a level and a half, we would look for a third 
rater. After some studies we concluded that this is probabU one of the most 
accurate, and one of the best, ways of assuring the reliability of the score, 
because we know that stalisticdllv the average rating is always more accurate 
than the rating of the best scorer. 

Oiler: I don't see an\ basis for that kind of detailed anahsis without some 
fairU solid research to show that it's superior. All >ou*re doing is inultiplying 
the points on the scale To get back to the discussion at hand, however, it 
seems to me that the s\stem of oral interview can work. I feel that it would be 
possible to operationalue the definitions of what constitutes a zero rating, or 
a fi\e rating bv simplv making some permanent recordings and keeping them 
in store, using them in the training of interviewers and in testing the reliabil- 
it\ of different groups of interviewers against a collection of data based on 
that store of tapes If that kind of calibration is done, and if reliability re- 
search indicates that interviewers are capable of agreeing on that particular 
set of tapes, then I think that >ou've got some pretty solid evidence that the 
interview is working. 

Wilds: rhal works in the case of the more commonly tested languages, but 
It just isn't available for languages where fewer than 30 tests are given a \ear. 
which may reflect only six levels of proficiency. 

Spolsky: I think that the ({uestion that Professor Lado raised earlier about the 
validity of an interview is a verv good one. because one can ask whether or 
not an interview is valid for more than performance in an interview. That is. 
to what extent does performance in an interview predict performance in 
other kinds of real-life situations. From a sociolinguistic viewpoint, one can 
define a whole group of situations in which people are expected to perform - 
interacting with different kinds of subjects, speaking to different kinds of 
people about different kinds of topics. The (juestion can be raised to what 
extent an interview and a conversation can sample all of these situations. I 
raised that (juesUon before, uhrn talking abi)ut the work Tucker has done, 
where he has defined specific communication situations. Perhaps I could 
raise it again from this (| uestion. To what extent have there been studies of 
the accuracv of judgments made on the basis of FSI interviews';* To what ex- 
tent IS there follow-up work, to what extenl is there feed-back, when exami- 
nees go out into a real-world situation? Is there anv wav of finding out how 
accurately these judgments work out? 
O Ids: This has not been systematically examined as far as 1 know. Certainly 





The Oral Interview Test 41 



not recentK When we used to have regional language supervisors visiting 
embassies o\erseas, there were checks of sorts. MostK the feed-back has 
been silence OccasionalU supervisors have said, 'You've been unjust and 
should have given a higher rM'in^ to someone that you've underrated." But 
there hasn't been a svstemalic stud> made, for example. b\ following some- 
one around all day in his job. 

Spoisky; In other words, what \ou'd get would be complaints, and these 
complaints would depend presumabU on whether a language-essential job is 
in fact a language-essential job If somebod\ who has been rated on one of 
these things could mo\e into whdt is described as a language-essential job. 
but is not re(|uired to use it a great deal, there would be no complaint. 
Frey: I'm wondering if the oral inter\iev\ is an effective v\a\ of testing gram- 
mar and \ocabular> Can't we do a belter job bv paper and pencil tests? 
Wilds: If u)u want that kind of separate information The question is whether 
it would suppU information that is useful to people as far as proficiency on 
the job goes or as far as going into training.- 

Frey; Are \ou testing some other t>pe of vocabulary and grammar,, then? I 
alv\a>s thought that there was ^iist one t>pe of grammar and vocabulary. I 
notice \ou ha\e gi\en grammar a weighting of 3 and vocabulary a weighting 
of 2. That's a ver\ high weighting for the oral interview. And if one comes 
out \er\ high in these, does that mean he can communicate? Someone can 
communicate \er\ well while still having man> grammatical errors in his 
speech 

Wilds: But it vuu can t put words together and don't have any vocabulary to 
[)ut together, you can t communicate. 

Oiler: Along that line, do \ou know what the correlation is between the dif- 
ferent scales I frankU don't believe the difference between grammar and 
vocabulary on tests. I would expect those to be verv highlv intercorrelaleJ. 
Wilds: I ibink thev are I'd like to reiterate that the checklist does lot nor- 
mally determine the grade. It s supportive evidence, and it's relativeN rarely 
calculated. It simpiv provides Ihe testing unit with a profile. Usually at ,FSI 
the examiner takes notes on the performance and will report to the exami- 
nee, if he IS inler(?sted. where his weaknesses are. But it's not the determining 
factor. 

Da vies: (^ould I ask a different sort of question which relates to vour com 
m(?nl about the acceptance of social dialects'^ It seems to me very sensible, I 
wonder whether vou have anv experience with dialects of a different nature, 
for example, geographical dialects, whether you have the same attitude to- 
ward them, or how you handle what we might call "age-related " dialects, in 
the sen:»e of how young peopb? now speak? 

Wilds: Except at the highest levels of the scale, this probably is not important. 
Somebody wi.o is up through a 3-h is not likely to make that a problem for 
the examiner. He would look more like other 3's or 2-h's than he would like 
h«? native speaker of a particular age group. 




42 Testing /.anguage Proficiency 



Cartier: Following up on a couple of things thai {^rofessor Spolskv said a mo- 
ment ago. P^irst of all. what Spolskv wants to do that we're not doing is to make 
a distinction betv\een whether language is the problem or whfjther language is 
the solution to a [jroblem Being a coniniuniLaliuii man rather than a linguist. 
I Il'IuI to sidi' with Spolskv on tfiis The proljlem is toniinunication. the solu- 
tion IS language, oi a [jartial solution is language And what Spolskv wants to 
do IS to assure that the meiisures thai we make, wherebv we're going to pre- 
ilitt the operational tapabilitv of a man on the jol . are concerned with his 
abilitv to Communicate and (;o[)e with real-lift? behaviors, regardless of 
whether he is linguistu.allv qualified. And let me [mint out that without at 
least metric aLC(;ss to the criterion situation, we have what we must call a 
surrogtite criterion We would like to. for exam[)le. correlalt? [)a[)er and pencil 
tests with interviews, and the reason we would like to do that is that the inter- 
viewer gives us this kind of surrogate criterion which we have to use simplv 
because vve can't a[)[)l> anv sort of metric to the criterion po[)ulation and 
situation 1 hcue another [)Oint to make about the problem of the interview 
techniipie iis j inrasiire You will recall that \tiss VVilds said that the people 
that give these interviews are instructors in the language, professional lin- 
guists and so forth In this regard Svdnev Sako and I had an interesting expe- 
rience a cuuple of vears ago when we were asked to develop an oral pro- 
ficiencv lest in Vietnamese Since Svdne> and I have no knowledge of Viel- 
ntimese whatsoever, we bad to go to the \'ielnamese facullv at DLI and have 
them construct some sentent.es and dialogues to certain s[)ecifications for us. 
A Vietnamese .\ir Forte Captain who was working there was approached and 
said that he\I [)e perfectly willing to make recordings of these. He went to the 
stuilio. and about 2f) minutes later he came [)ack. and he said. "I a[)ologize. 
but I am unable to make these recordings for vou.'' I said "What's the matter. 
IS It [)ad Vietnamese-*" And he said. "Oh no' It's superb Vietnamese, but it is 
not the wav pilots talk It's the wav teachers talk." One of the problems with 
the intervu.wvs is that tfiev are [)eing given bv the wrong [)eo[)le. This [)ro[)lem 
(jf whether vuu are going to rale a man down [)ecause his grammar is ht\d or 
not keeps coming up all the time. I want to find out. can the man cope? I don't 
care how bad his grammar is. uni' there are situations where the social 
acceptableness of his language does [, )me a factor. 

Swift: 1 I list wanted to comment on I^rofessor Oiler's ijuestion concerning 
the currelatinn of grammar and voca[)iilarv. We have observed over the vears 
something that we faceliouslv call the Peace Corps Syndrome, but it applies to 
almost «inv person who comes to l)e tested, whose formal training has been 
tolativelv short, and whose ex[)osure to the language in the field has [)een 
coniparalivelv long. I would sav there is here a distinct non-correlation 
between grammar and vocal)ularv. with the [)ossi[)ililv of a wide range of vo- 
cabiilarv used in a verv minimal set of gramnialieal striK.tiires. And it is fre- 
(juentlv (juite good communication This sometimes raises the [)ro[)lem of 
^ ' ;ther we are going to ap()lv the same standards in terms of weighting the 





The Oral Interview Test 43 



grammar for this kind of lest if what we're reallv trvin^ to test \i> communica- 
tive abilitv 

Oiler: All I can sav is i agree v\ilh v\hal Fran Carlier said, and vvUh Spolsky's 
arguments along those lines i*m douliltul about the research behind the com- 
ment on the lack of correlation belv\een grammar and vocabiilar\. 1 think if 
vou have a good vocahularv test and a good grammar test, and if you give it 
to tvpical populations of non-native speakers, you'll discover a very high 
correlation, above the 80' ; level. And what this suggests to me is that what 
linguists have encouraged us to believe as two separate functional systems, 
lexis and grammar, are in fact a whole lot more closelv related to some un- 
derlying communicative competence. And my argument is that if you do 
careful research on it. I expect youMl find that those five scales are very 
closely related We did a little bit of that at UCLA and discovered that they 
werp indistinguishable for practical purposes on scales of this sort. But that's 
not published research, and I don't know of anv other published research 
which could be carefully examined and challenged. 

Clark: I think tpnle a lot of the questions here deal with the problem of what 
IS the criterion on which the interview performances are to be rated or 
evaluated From my point of view I think the big selling point of the FSI 
interview- is that it permits judgments about the person's ability to do cer- 
tain lhin^,s with the language in real-life terms, or at least portions of the 
interview do If vou look at the scoring svstem for the FSI, there's some inter- 
mingling of competencies in the sense of ordering a meal, finding one's way 
aroi nd. etc. and on the other hand, how much grammar he knows, what his 
pronunciation is like, and so forth. If it could be possible to weed out the 
sliiclb. structural aspects of the FSI criteria and stick instead with operational 
statements of what he can do. then I believe our problem is solved. We use 
the face-to-face interview of the operational type, and then we correlate the 
results of this with verv highly diagnostic tests of vocabulary, grammar and 
so forth, and we actuallv see empiricallv what the relationships are at dif- 
ferent levels of performance 

Spolsky: What we're doing here actuallv is criticizing the fact that the inter- 
view test Is not a direct measure but is an indirect measure of something else. 
I think vve can get a clearer view if we add the sociolinguistic dimension that 
we're talking in But if we're talking about the situations in which language is 
going to be used, the conversation that comes up in the interview is only one 
of those situations. It's clear that one would expect a good correlation be- 
tween performances in an interview and anv other conversations with either 
language teachers or people who speak like language teachers. But there's the 
question of doing some of these other functions that could be different. The 
othor point I was going \o mention here deals with the problem of correlation 
between grammar, vocabulary, and performances of various kinds, which is. 
I think, related to the point that |ohn Oiler makes in another paper that I 
^ :enlly read, where he talks about the relevance of the language learning 

ERIC 



44 Testing Language Proficiency 



history, and that people who learn a language in different contexts are likely 
to be better at different parts of language. It is theoretically possible for two 
people with a vocabulary of 10.000 words to have only-depending on the 
language -800 of those words in common. It*s also going to be theoretically 
possible that two people will get bv in languages making quite distinct basic 
errors in those languages, and will continue speaking the language for many 
\ears still making quite different basic errors. There are certain things that 
will happen overall that will average out. But when it comes to judging an 
individual, there's likely to be the effect of two different language learning 
pasts. I think a comparison of ex-Peace Corps volunteers with normal college 
foreign language majors would bring this point out extremely clearly. And 
then there is this whole question of the communication or sociolinguistic 
analysis of the kinds of predictions you want to make on the basis of the test. 
When one looks at that second picture, then I think you can argue that the 
interview test has to be dealt with also as an indirect measure, and one has 
to decide what is the direct measure against which to correlate it. 
Tetrault: How do you combine a functional evaluation with a check of specif- 
ic points of structure? How do you elicit points of structure? 
Wilds: For example, eliciting a subjunctive that hasn't occurred naturally 
tT?ight happen in an interpreting situation, where you have the examinee ask 
the other examiner. 'He wants to know if it's possible for you to come back 
later." So that if at all possible all structural elements are elicited in the con- 
text of some functional problem. 

Tetrault: I assume then you'd have to. in some cases, elicit it from English 
rather tr.an th^ language. 

Wilds: That's right, and that's why I think there's an advantage in having one 
examiner who speaks English natively. He can set up a situation in a very 
natural context. We never require formal interpreting; it's never set up to be 
a word-for-word thing. 



51 



Testing Communicative Competence in Listening 
Comprehension 

Peter J. M. Groot 



1 0 Introduction. Forei.i^n lan^iui^e ne«ds in present-day society have 
changed greatly during the past 20-30 years. Nowadays much more 
importance is attached to the ability to speak and understand a for- 
eign language because many more contacts with members of other 
linguistic communities take place through what is sometimes called 
the phonic layer of language, i.e. listening and speaking (telephone, 
television, radio, stays abroad for business and/or recreational pur- 
poses, etc.). Changes in foreign language needs accordingly must be 
reflected in foreign language teaching and testing. This paper gives a 
rough description of the development of listening comprehension 
tests to be administered to final year students in some types of 
secondary schools in Holland. Its purpose is to serve as an example 
of how tests of communicative ability should be developed, whether 
it be in a school situation or during a language training program for 
students who are going to work abroad. Of course, the specific aims of 
the various educational situations will differ, but the principles under- 
lying the construction of reliable, valid and economical tests largely 
remain the same. 

In 1969 the Ministry of Education asked the Institute of Applied 
Linguistics of the University of Utrecht to develop listening compre- 
hension tests to be introduced as part of experimental modern for- 
eign language exams (French. German and English) administered at 
some types of secondary schools. 

M. Organization. On the basis of an estimate of the activities to be 
carried out., a team was formed consisting of one teacher of French, 
one teacher of German, one teacher of Englii^h, some project-assist- 
ants and a director of research. 

1.2. Research plan. The research plan to be followed would roughly 
comprise three stages: (a) Formulation of an objective for listening 
comprehension of French. German and English, with interpretation 
of the term listening comprehension, and formulation of a listening 
comprehension objective on the basis of (a) and (b): (b) Opera- 
lionalization of the listening comprehension objective; (c) Validating 
the operationalisation. 



45 



46 Testing Language Proficiency 



2.0. Formulation of the Objective. The question whether a test is 
valid cannot be answered if one does not know the objective the test 
is supposed to measure.^ Hence, the first stage will have to be the 
formulation of the objective that should be tested. The official objec- 
tives for the teaching of modern languages in the Netherlands, as laid 
down in official documents, are extremely vague or nonexistent. 

Abroad, some attempts have been made to formulate objectives for 
modern languages but. if listening comprehension is separately speci- 
fied at all. either the formulation of the objective is much lacking in 
explicitness or the objective is not relevant to the situation in Holland. 
As a result, it is not surprising that there are many interpretations of 
the term listening comprehension being applied in current teaching 
practice. The first step to be taken in formulating an objective, then, 
will b(> to give an interpretation of the term listening comprehension. 

2.1. Interpretation of the term listening comprehension. The two 
guiding principles in formulating any educational objective will be 
utility and desirability and feasibility. In interpreting the term listen- 
ing comprehension, therefore,, it is necessary to give an interpretation 
that is both useful and feasible. 

How does one arrive at such an interpretation? Our starling point 
is the premise that the primary function of language is communica- 
tion, i.e. the transmission and reception of information. The foreign 
language teacher s task. then, is to teach his pupils to communicate in 
the foreign language. Consequently, these objectives will have to be 
descriptions of (levels of) communicative ability. 

If we now turn to current liMening comprehension teaching and 
testing practice, we find that it is very often based on interpretations 
that result in teaching and testing skills, such as dictation or sound 
discrimination, that cannot properly be called communicative abili- 
ties. These may be useful activities during the learning process, but 
they can hardly be said to constitute communicative abilities in any 
useful sense of the word. A useful interpretation of the term listening 
comprehension wlii thus have a strong communicative bias; in other 
words, its general meaning will be picking up the information-the 
auditory messages encoded from presented language-samples, 

2.2. Determining the listening comprehension level. The interpreta- 
tion given in 2.1 to the term listening comprehension was used to 
construct a test with open-ended questions consisting of language 
samples that were selected as to their difficulty level on mainly intui- 
tive grounds from a number of sources. The questions measured 
whether the most important information had been understood. These 
tests (one each for French. German and English) were administered 
to some 150 pupils divided among 4 s^^hools. The scores provided 

^ idence in connection with the degree of difficulty of the language- 



TesCing Communicatmi Competence in Listening Comprehension 47 



samples that the pupils of the 25 schools taking part in the project 
could be expected to handle, in other words, what would be feasible. 

2,3. Formulation of the listening comprehension objective. Ideally, 
the process of formulating objectives for the four language skills, 
(listening, speaking, reading, writing) will pass through five stages: 

(1) Interpreting the terms listening, speaking, reading, writing, i.e. 
defining the nature of the skill: (2) Making a typology of the situations 
in whicTi the students will have to use the foreign language after their 
school or training period and determining how they will have to use 
it (receptively and/or productively, written and/or orally): (3) Deter- 
mining the) ' linguistic content" of the situations referred to under 

(2) : (4) Determining what is feasible in the school situation: (5) For- 
mulating objectives on the basis of (1) through (4). Much of the work 
mentioned under (2) and (3) remains to be done. It is therefore clear 
that formulating a listening comprehension objective was not an 
easy task. 

, Using thi; arguments and findings described in 2.1 and 2.2. the fol- 
lowing objective was formulated: The ability to understand English/ 
I-rench/German speech spontaneously produced, at normal conver- 
sational tempo, by educated native speakers, containing only lexical 
and syntactic elements that are also readily understandable to less 
educated native speakers (but avoiding elements of an extremely 
informal nature), and dealing with topics of general interest. 

2.3.1. KxplanQiory remorks and comments. The main reason for 
explicitly defining the language to be understood as speech was the 
fact that in language teaching written language receives enough em- 
phasis liut spoken language is much neglected. Most people will 
accept that the ability to understand spontaneously produced speech 
is a desirable objective for French. German and English, one reason 
being that it is a necessary condition for taking part in a conversation 
in the foreign language. Now, the spoken language differs in many 
respects from the written language, mainly because the time for 
reflection while producing it is much more limited. ^ For this reason, 
if we want to make sure spoken language is taught and tested, it 
.should be mentioned explicitly in the objective. 

A good language teaching objective should explicitly define the 
language samples that can be put in the test used to measure whether 
the pupils have (sufficientlv) reached the objective. The above listen- 
ing comprehension objective falls short of this requirement. The 
spontaneous speech of educated native speakers within the area as 
defined by the objective will still vary wideK as regards speech-rate: 
lexical, idiomatic and syntactic characteristics: etc. 

This means that the limitations mentioned in the objective are not 
^^vact enough. To make them more explicit, many questions will have 

ERIC 



48 Testing I^onguu^e I'roficiency 



to be answered first, questions such as th(j following: 

• What is normal conversational tempo? We know that there is a 
large variety in speech-rate between individual native speakers. In a 
pilot study for English, for example, a ranjje of 11-23 centiseconds per 
syllable was found. 

• What are topics of general interest? The reason for taking up this 
specification in the objective was to avoid giving one section of the 
population an advantage over another. It is clear that this element in 
the objective does not apply to situations where the terminal language 
behavior aimed at by the course is much more specifiable. 

• What s\ntactic elements are readily understandable to less edu- 
cated (i.e. without a secondary school education) native speakers? 
Verv little is known about correlates between syntactic complexity 
and perceptual difficultv. Psvcholinguistic research (cf. Bever 1971) 
has (.onvincinglv proved that there are corrf?lates. but in most cases 
this evidence was found in laboratory experiments with isolated 
sentences. Even if the internal \1ilidity of these experiments is high, 
the e.xternal validitv is doubtful, in other words, it is questionable as 
to bow far these f in(iinj^.s i,an be extrapolated to real-life situations. 

• What is the ^ffiect of limiting the test to educated native speakers 
(i.e. native speakers with at least a secondary school education)? 
Educated native speakers are referred to in the objective as a means 
of limiting the range of accents of the language samples that can be 
used in the test. 

Although answers to the above questions may never be completely 
satisfactory, the listening comprehension objective formulated in 2.3 
does give the teacher and student a much clearer view of what is 
expected after the secon(Jarv school period than did the formulations 
referred to in 2.0. 

3.0, Operationohsing the Objective. The fact t! at the listening com- 
prehension objective formulated in 2.3 is a compromise between what 
is desirable and useful, on the one hand, and what is feasible, on the 
other, has implications for the tests that can be considered good op- 
erationalisations of the objective. These tests will have the nature of 
both achievement tests (the feasibility aspect) and proficiency tests 
(the desirability a.spect). An achievement test measures knowledge, 
insight and skills which the testees can be expected to demonstrate 
on the basis of a svllabus that has been covered, while proficiency 
tests measure knowledge, insight ind skills, irrespective of a particu- 
lar syllabus. 

Achievement tests are concerned with a past syllabus, while pro- 
ficiency tests are concerned with measuring abilities that the testee 
will have to demonstrate in the future. A te.st. to be used for final 
Q Lamination purposes, will thus have the character of both an achieve- 




rrrr. 



Testing ('o/nmunicativt* Competence in Usi ing Comprehension 49 



menl and a proficiencv lost: in other words, it will lest what has been 
learned and what "should * have been learned. 

Apart from the above arguments, there is also another, more prag- 
matic argument to defend final (language) exams having this h\brid 
character Thev could not be achievement tests onlv. since, in schools 
where the tests are given, the svllabi vary depending on what text- 
books and other course material (readers, articles, etc.) the individual 
teacher has chosen * One of the consequences of the hybrid nature of 
the tests operationalising the listening comprehension objective is the 
fart that teachers cannot restrict themselves to training their students 
in a ptiiiicular sxllabus. Also. the\ will have to give proper training in 
the (behavioural) skills s[)ecified in the objective. 

3.1. In order to produce a reUable.^ valid and economical opera- 
tionalisation of the listening comprehension objective the following 
demands^ had to be met in constructing the tests. 

3.1 1. The questions in the test should measure whether testees have 
listened v\ith understanding to the language samples presented. They 
should not measure knowledge of a particular lexical or syntactic 
element from the sam[)le. since understanding the sample need not 
be equivalent to knowing everv element in it. Ideally, the semantic 
essence of the language sample constitutes the correct answer to the 
test question. 

If we want !hc test to be valid., it is essential for the questions to 
measure global comprehension of the samples. Hov\ this global com- 
prehension is arrived at is largely unknown, because we have no ade- 
quate analysis of listening comprehension at our disposal. We know 
little of the components of listening comprehension and even less of 
their relative importance. Of course, one can safely say that knowl- 
edge of the vocabulary, syntax and phonology of the target-language 
are important factors. Most language tests limit themselves to meas- 
uring these com[)onents. but most of the evidence accumulated in 
rec(jnt testing research corroborates the statement that communicative 
competence (i.e. the ability to handle language as a means of com- 
munication) is more than the sum of its linguistic components. For 
that reason a test of listening comprehf»nsion. as described in the 
objective, cannot be valid if it only measu.es the testee's command of 
the (supposed) linguistic compijnents. since its validity correlates 
with the extent to which it mt isures the whole construct: both the 
linguistic and non-linguistic components of listening comprehension. 

3.1 2. Since the language samples to be used in the test have to be 
bits of spontaneous speech, they must be selected from some form 
of conversation (dialogue, group-discussion, interview, etc.). en- 
sure this. :he samples were selected from recordings of talks between 
native speakers. 




50 Testing /.unguage Proficiency 



3.1-3. In some real-life lisleninjj situations (radio, television, films| 
the listener will not be in a position to have the message repeated. In 
other such situations (e.^. conversation), this possibility does exist, 
but an excessive reliance on it indicates deficient listening compre- 
hension. For this reason (the validity of the test as a proficienc\ test)» 
' • as decided to present the auditive stimuli once only. 

3.1.4. Althou>»h memory, both short and long term, plays an impor- 
tant role in the listening comprehension process, it should not be 
heaviU taxed in a test of foreign language listening comprehension, 
when it is familiarit\ uith the foreign language that should be pri- 
marily measured. To safeguard this, the length of the language sam- 
ples was restricted to 30-45 seconds. 

3.1.5. Similarh. to ensure that the foreign language listening com- 
prehension test does not excessiveK emphasize the reasoning com- 
ponent, the concepts presented in the samples should be relatively 
eas>. This can be checked b> presenting them to native speakers some 
two or three \ears \oun^er than the target population (cf. also 5.3.5). 

3.1.6. Since the tests were to be used on large populations, distrib- 
uted amon^ man\ schools and teachers, the test questions were of the 
mi )le-choice t>pe. It was f jiind that items consisting of a stem plus 
ihr^e alternatives, mstead of the usual four, were most practical. 

5.1.7. The multiple-choice questions should be presented in a writ- 
ten form. If the> are presented auditorily, test scores may be negative- 
ly influenced b\ the fact that testees fai' to understand the questions. 

3.1.3. In order to standardise the acoustic conditions of presenta- 
tion, it was decided to administer the tests in language laboratories to 
eliminate, as much as possible, sources of interfering noise, both in 
and out of the classroom. 

3.2. Description of the lest. Testees, seated in a language laboratory 
with individual headphones, listen to taped interviews or discussions, 
which are split up into passages of about 30-45 seconds. Following 
each passage there is a twenty-second pause in which testees answer 
on a separate answer sheet a multiple-choice global comprehension 
question. The lest consists of fifty items, takes approximately one 
hour and comprises three different interviews or discussions. After 
each part of sixteen to seventeen items, there is a break of at least ten 
minutes. The tests are pretested on native speakers and Dutch pupils. 
After item analysis, the final form is administered to the target popu- 
lation in two "scrambled " versions to avoid cheating*. The following 
are examples taken from the examination listening com.prehension 
tests for 1972. 



French 



Q Question: Est-ce que le whiskv est un concurrent pour les bois- 





Tasting (AimmunicdUve Competence in Listening Comprehension 51 



sons frangaises? 

Response: V'ous savez que le \\hisk\ a ete une des boissons qui 
s est le plus dtiveloppees dans les pays du continent depuis 
quelques annees, c'est dovenu une boisson a la mode. 11 est 
certain que cette nouvelle mode a ete un concurrent pour 
certains produits traditionnels frangais ... certains aperatifs. 
certains vins. peiit-6tre m^me nos spiritueux. 

Item: Est-ce que le \\hisk\ est un concurrent pour les boissons 
frangaises. selon M.J.? 

A \'on. parce que boire du \\hisk\ est une mode que passera. 
B \'on. parce que le \vhisk\ differe trop (les boissons fran- 
gaises. 

C Qui. parce que le \\hi^k\ a beaucoup de succes actuelle- 
ment. 

English 

Question Talking about newspapers, what do you object to in 
the presentation of news? 

Ansiver: What I strongly depreciate is an intermingling of news 
with editorial comment Editorial comment's terribly er... easy 
to do. but news and facts are sacred and should be kept at all 
lime quite, ([uite distinct. I think it's very wrong and you have 
.this in so many newspapers where the editorial complexion or 
the political complexion of the newspaper determines its pre- 
sentation of facts, emphasizing what they consider should be 
emphasized and not emphasizing unhappy facts which conflict 
with their particular point of view. 

Item: What does Mr. Ellison Davis object to in some news- 
pa pers'i* 

A That the way they present their news is too complex. 
B That the editor presents his opinions as news items. 
G That their presentation of facts is influenced by editorial 
views. 

German 

Frage: I'rau K.. Sio sind nun berufstalig. Was denken Sie liber 
die berufstatige Frau mil kleinen Kinder? 

Antwort; Da musste ihr natiirlich der Staat sehr vie! helfen. Hat 
diese Frau Kinder dann muss ihr die Moglichkeit geboten 
werden. das Kind in einen Kindergarten stecken zu konnen, 
der a. gut ist. d.h. eine KindergSrlnerin muss fiir kleine Grup- 
pen da sein., und der den ganzen Tag offen ist. dass sie nicht 
mittags schnell nach Hause laufen muss um zu sehen. was nun 
das Kind machl. Ehm., dann ist es wohl moglich. dass sie auch 




52 Testing Language Proficiency 



wahrend der Ehe berufslalig isl. Vorausgesetist naturlich. dass 
aiich der Mann diese Mdglichkeit akzeptiert. 
Item. Was denkl Frau K. iiber eine berufstatige Fran mil kleinen 
Kindern'^ 

A Der Slaal sollle ihr das Arbeilen ermoglichen. 
B Die Meiniing des Mannes verhinderl die berufsiaiigkeil 
vieler Fraiien. 

C Nur morgens sollle sie arbeilen. millags sollte sie fur die 
Kinder da sein. 



4,0. Reliabilily, Since 1969 many lislening comprehension lesls of 
the kind described in 3,2 have been conslrucled and administered. 
The liability of the tests, as calculated with the Kuder-Richardson 21 
formula., ranged from .70 to .80, Taking into account the complexity of 
the skill measured, these figures can be considered satisfactory. In- 
deed, it remains to be seen whether listening comprehension tests of 
this kind can be constructed that show higher reliability coefficients. 
If not. one of the implications could be that, in calculating correla- 
tion coefficients of these tests with other tests., correction for attenu- 
ation cannot be applied. 

In general, the listening comprehension tests for French show the 
highest reliability and standard deviations, the tests for German show 
the lowest and the English tests take a middle position. The figures for 
the 1972 Fench listening comprehension test shown below may be 
considered representative for most of the psychometric indices of the 
listening comprehension tests administered. 

Results, English listening comprehension test. 1972 

Number of testees 840 

Mean score 807( 

Standard deviation 11,82 

Reliability (KR-21) 77 

5.0. VGiidity. The listening comprehension objective formulated in 
2.3 considerably limits the amount of valid pperationalisations, but it 
still allows for more than one. 

We chose the operationalisation described in 3.2 because it best 
meets both \alidit> and educational requirements. Various questions 
in connection with its validity can be raised, however. Should the 
multiple-choice questions be presented before or after listening to 
the passage? Does the fact that multiple-choice questions are pui in 
the target language affect the scores? Is the use in the distractors of 
the multiple-choice question of a word (or words) taken from the 
^ passage a valid technique? Should the testees be allowed to make 



C 



f59 ' 



Testing Communicotive Competence in Listening Comprehension 53 



notes? How long should the passages (items) of the test be? 

The last two questions have been dealt with during discussions with 
the teachers taking part in the experiment on the basis of their 
experiences in administering the tests. It was not considered advisable 
to allow the testees to make notes while listening because this would 
decrease the attention given to listening. The length of the passages 
should not exceed 45 seconds in connection with concentration prob- 
lems (cf. 3.1). 

The first three questions have been dealt with in experiments of the 
follovxing type: a control group and an experimental group were 
formed, which were comparable as to listening comprehension on the 
basis of scores on previous listening comprehension tests (equal mean 
score, standard deviation, etc.). The two groups took the same test in 
two forms, the difference being the variable to be investigated. The 
results of experiments carried put in this stage are given below.^ 

Kxperiment 1 

Variable: multiple-choice questions before listening to the pas- 
sage. 

Control group (85 testees) Questions after 717< 

Experimental group (85 testees) Questions before 727f 

These data were discussed with the teachers taking part in the 
experiment, and it was decided to present the questions before 
listening to the passages. The general feeling was that this tech- 
nique made the listening activity required more natural and life- 
like, because it enabled the testees to listen selectively. 

Kxperiment 2 

Variable: multiple choice questions in mother tongue. 
Control group (120 testees) 

Questions in foreign language 77Vr 
Experimental group (120 testees) 

Questions in mother tongue 827r 

During discu.ssions with the teachers, it was decided to present 
the (juestions in the foreign language because the pupils pre- 
ferred it and the difference in the mean scores of the two groups 
was relatively small. 

Kxperiment 3: Echoic elements 

In this experiment the object was to determine the effect of using 
so-called "echoic" elements in the alternatives of the multiple 
^ choice questions. (Echoic elements are words, taken from the 



C 



60 



54 Testing Language Proficiency 



passage., thai are used in the alternatives.) A twenty-item test was 
constructed so that the correct alternatives of the items con- 
tained hardly any echoic elements — one distractor contained 
echoic elements, one did not. This lest was administered to a 
group of eighty pupils who had taken other listening compre- 
hension tests. The item analysis of the scores showed an average 
discrimination value of .41. From this the conclusion was drawn 
that the use of echoic elements in the distraclors (and sometimes 
in the correct alternative, of course) is indeed a good technique 
to separate poor from good listeners. 

5.1, After eviiiualing the outcome of the experiments and discus- 
sions referred to in 5.0. proper validation of the tests in their final 
form could start. The tests that were validated were the examination 
tests of 1971 and 1972. 

Following Cronbach's (1966) division. I shall deal with content 
validity, concurrent validity and construct validity. 

5.2. Content vulidity. What was said in 3.0 about the nature of these 
lusts (parlU achicv f mtMil. [)cirll\ [)roficienc\ ) implies that in establish- 
ing their conlenl validity there are two questions thai have to be dealt 
uilh. |i) To what L»xlenl tlo the tests atleciuateK sample the common 
core of the instructional s\llabus the target [)()[)ulation has covered? 
(2| To what extent do the tests adequateK sample the listening pro- 
ficiency describetl in behavioural terms in the objective? 

5.2.1. As regards the first (|uestion the intuitions, based on teach- 
ing e\[)enence of the members of the test construction team about the 
common core of the s\llabi covered h\ various schools taking part, 
proved to be highlv reliablr. pA idence for the reliabilitv was acquired 
during discussions with the teacher.s on the te.sl.s administered where, 
onlv rarelv. lexical or .svniactic elements in the tests were objected to 
as being unfair. In this context one has to bear in mind that answering 
the global (,om[)rtihension tjuestions corio^itlv does not (le[)end on 
knowledge of each and everv lexical and/or syntactic element (cf. 
3.0). 

5.2.2. A much nioie (complicated ^)^oblem is i)osed by the second 
(juesi'on. In (he obje(.live. the level of listening com[)rehension ex- 
pet, ted of ihfj [)U[)ils IS described in functional terms. It is an attempt 
to s[)e(;ify in what sociolinguistic; context pu[)ils are ex[)ecte(l to be- 
have ade(juately (i.e. understand the message). It does not give a 
detailed lingai->lic description of these sociolinguistic situations. Some 
linguistK. chtiraf tensti(.s are given (in connection with lexical tind 
syntactic as[)e(,us). but these arti not very [)re(;ise. One of the conse- 
(jiien:,es is that it will be im[)ossible to claim a high content-validity 
in linguistic terms of a lest o[)erationalising the objective.^ Claims in 
i."^nne(,tion with content-v tiliditv will have to be based on evidence 





Testing CommunicaUve Compotence in Listening Comprehension 55 



conCLM-ning the represenli\il\ of the situations in the test for the 
universe of situations described in the objeclue. For this reason, the 
TOLC-lesls are r.ither long (fift\ items dealing with a ran«»e of topics). 
We are confident that these tests form a i epresentativ e selection of 
the situations defined in the ob)ective. 

5.3. Concurrent validity. The concurrent vali(ht\ of the TOLC-lesls 
was investigated in various experiments. 

5.31. One of the assumptions in connection v\ith the TOLC-tests is 
that pupils who score high on these tests will understand French^ 
German and Hngb.sh on radio and television !)etter than pupils who 
score lower. To find out whether this assumfJiion was warranted, a 
lest was constructed consisting of language sanq)les selected from the 
abo\e sources. This test was administered on a po[)iilation of pupils 
that also took a selection of the 1972 TOLC-test 

Results 

Selection '72 test: 'M) items Number of pupils: 120 
"Ratlio-ltist:" 'M) items (p.m.) correlation: .67 

5.3 2. The 1971 and 72 T(3l»C-tests were correlated with teacher 
ratings. Some teachers were asked to rate their pupils' listening com- 
prehension on a four-i)oint scale. These pupils also took a TOLC-tesl. 
and the teacht^r ratings were correlated with the scores on the test. 
The p.m. correlations ranged from .20 to .70. (This lack of con- 
sislencx ma> be tixplaintui b\ the fact that listening comprehension 
as worked out in the TOLC-ttJSts was a relativel\ new skill to the 
teachers and hence bard to evaluate.) 

5,3.3. On the recjuest of some of the teachers taking part in the 
project, an experiment was carried out to determintj whether the fact 
that the test ijuestions were of the nullli^)le-choice t\pe influenced the 
scores in such a wa\ that it would invalidate* the test. For this purpose 
the 1971 and '72 TOLC-t(?sts were used. These two tests can be taken 
to be of a comparable degree of tlifficuitx, as witnessed the scores of 
the [)o[nilations who had taken them as exam tests. 

Results 



vSeloction 71: 40 items 
Selection '72 (open (juestions) 

Max. score 00: 00 items 

N'umber of [)U[)ils: 90 

(p,m.) correlation: .68 



Also, the selection of the 1971 test was correlated with the selec- 
tion of ihe 1972 test. [)rese?nted without any questions. The pupils 
had to give a summary of the passages listened to. The correla- 



56* 7'estmg I^anguoge Proficiency 

lion for K:i;^Iish was .69. 

5.3.4, The 1971 I'OLC^-lesl vvtis corru*t»iltHl v\ilh lislunin^ (,()ni[)rutuMi- 
Mon Il\sIj> tit', l)\ tin? Surtlish nL»[)arlnnMil of Ktlu(;»Uiun. Thusij 
IlvsIs wcu! Minilar lu IIil* TOLCMl'sIs as fai iis tluj pi tJsuntalion of the 
stimuli and tjueslions v\ure concerned The language sam[)les were 
different (eg. not spontaneously [)roduce(l speech) 

. Results 

Selection "rOL(]-test '71 36 items Xumber of [)U[)iLs: 73 
Swedish test: 29 items (p."i-) correlation. .64 

5.3.5. 4\lso. the scurcvs on the foreign language listening com[)rehen- 
sion tc\stb tiavc» he en con j>ared v\ith M.oies on tujuivalent com[)rehen- 
j>u)n tests in the mother tongue. White the foreign language test scores 
axeraged 70'. lor Kench. 74'. for English and 76s for German, the 
Dutch t(ist sct)rc»s lUeiagcnl 88' . . The fact that mother-tongue listening 
comprc'hension un higher le\ els is not [)erfect has alreach been 
shov\n l)\ other studies (Spearrit 1962. Xichols 1957; Wilkinson 1968). 
It IS reassuring to knov\. hov\e\er. that listening efficiency can he im- 
prov etl through relativeh sim[)lc» liaining [jrocedures (Erickson 1954), 

5 3.6. Construct validity. In ortier to find out more about the con- 
struct that hat! ai)parentl\ been measured, the TOLC-test scores were 
con elated v\ith scores on tests of \arious h\ [)othesise(l components 
of listening corn[)rehension. The [)rediction v\as that scores on the test 
of linguistic corn[)onents (vocabulary, grammar. [)honolog\) would 
correlate higher with scores on the global listening com[)rehc»nsion 
tests than v\t)uld scores (.n tests of non- or [)ara-linguistic com[)onenls,, 
such as niemorv. intelligence). c»tc. 

The linguistic com[)onents v\ere tested in various v\avs (e.g. vocal)u- 
lar\ was tested h\ means of tests [)resenting isolated words and tests 
[)rc»sc»nting contextualiscul words). The non-linguistic components 
v\ere tested h\ means of standardised tests (Raven's Advanced Pro- 
grc»ssive Matrices Set I. Auditory Letter Span Test MS-3. etc.). The 
rc'sults shov\c»d that the correlations between Ihe non-linguistic sub- 
ttists cUid the global listtming com[)rehension tests were indeed much 
lower (ranging from .07 to .23) than the correlations between the 
linguistic subtests and tbe global listening comprehension tests 
(ranging from .40 to .69). 

6.0. Concluding Remorks I have intentionalh refrained from giving 
a detaihid anahsis of th() above* data or suggesting directions for 
furtlujr research into listening com[)rehension. M\ main [)oint has 
been to demonstrate in what wav tests can be a vital pari of research 
into communicative com[)etence. Further research might take the 
^ n of comparing the o\c»rall listening com[)rehension test scores 




Testing Communjcalive Competence m Listening Comprehension 57 



with rehults ol ciu/e lusts usni^ retlucod reduntlcincx . It mi^hl Idke the 
form of fcH.tor cUiciKsKs of the lumdretls of listening com^)rehtMlSlu^ 
{es\ itonih thcil hci\t bv<^n .ulminislertMl to see whether cin\ [)rominenl 
tiiLlors emerge ami. il so. whtMher tlun CcUi be mler[)reletl as [)iirls of 
a meaningful iinguistiu ui [)s\LhologiLal Iramewurk. But whatever 
form It takes, it will be a thsci[)linecl aclivit\, testing hypotheses 
Loncernmg cummLmiLative competence conceived on the solid basis of 
reliable em[)irical e\idence-a basis that seems to be sadiv lacking in 
moth i(»search un languagt? Uunnmg, resulting in fashionabK ex- 
changing one ill-foinuletl o[)inion for another. 

NOTES 

1 The phonu, lavur as opposed to the graphic layer (reading and writing), which was 
much moro ihe mode of hnguislic communication some decades ago 

2 ll IS not implied here that this is the onI\ condition to be fulfilled in order to estab- 
lish the validity of a test! 

3. Suffice It to mention syntactic irregularities, different choice of words, speech 
errors, hesitation pauses, etc. 

4 Whether these .subjectively chosen materials do cover the most frequest and useful 
words IS open to doubt To remedy this, an attempt will be made to produce, for the 
various types of schools, lists of words that have more objectively been proved to 
be useful for secondary school pupils to master 

5 Somt, to be induced from the objective, others to be added on pragmatic grounds 
rhe list of demands under 3 1 is by no means exhaustive It only gives the conditions 
these particular tests had to fulfill It does not specify the general requirements any 
good test has to satisfy. 

6 For the sake of brevity, the figures for the English test are given, as the figures for 
the German and French tests yielded very much the same patterns 

7 Even if the objective did give a detailed linguistic description, it would be difficult 
to establish content validity for a test operationalising it This is a general problem 
applicable to all language tests The root of this problem lies in the generative char- 
acter of natural language The rules governing a language are such that an infinite 
number of possible applications and combinations can be generated by speakers 
of that language. Consequently it will be difficult to determine whether the content 
of a test constitutes a representative selection of the possible applications and com- 
binations. 



DISCUSSION 

Clark: I notice thnt the one example item that is given is a three-option item. 
It woidd be relatively easy to make a fourth or even a fifth option for the 
Item, which. I think, would mcreiise the reliability of the test. We've tried 
somewhat the same Thing at KTS where two or three native hmguage speakers 
recorded about two or three minutes on topics like pollution, Watergate, and 
so forth, then multiple-choice questions were asked on the conversation. 
The reading difficulty problem was overcome by having the questions in the 
^ „ students native language. We've found that this type of real-life conversa- 




58 Testing Languoge Proficiency 



tional listening comprehension is much more difficiih than the discrete item. 
Spolsky: PresumabK because listening comprehension is the closest to 
umlerKmg competence-or has the fewest kinds of other perfor-nance fac- 
tors involved- it is least dependent on learning experience. With a certain 
amount of limited experience or exposure to a language, and a limited learn- 
ing history that includes exposure to the language, listening comprehension 
is going to be the one that is closest to the most basic knowledge of a lan- 
guage It's the first kind of thing that gets developed. It would be unusual 
to find somebody who is more proficient in speaking than in understanding. 
If \vc lake a test of listening abilitv. one would expect to fmd it correlates 
more highly with almost every other test than anything else. 
Jones: The problem is. of course, to measure the abilitv. How do you know 
if the student understood or not? He's got to respond in some way -which 
means that you're only secondariK getting at the performance. 
Cartierr I tlunk Spolskv is trving to get at the problem of the real world 
where ue listen either for information or directions to do something. We 
process the information in various wa>s. and the resultant behavior may be 
immediate or it ma> be wa\ off in the future. Ideally,, we would like to be 
able to test each of those two things in their real operational time Irames, so 
that if. for example, you're training aircraft mechanics you can teli ihem, '*If 
you ever run across a situation where a spark plug is in such and such a con- 
dition, then do so and so." Then if. two or three weeks later, they run across 
such a spark plug, will they in fact do such and such? Here you've ^>ot Ine 
problem of listening comprehension, memory, and a whole raft of other kinds 
of things that are involved, but certainly listening comprehension is a very 
strong part of it. In an article of mine in the TESOL Quarteriy some time back, 
I reported on criterion-referenced testing which used some surrogate criteria 
in reference to taking directions: For example, you make a tape recording 
which says in English "Go and get a 7/16th wrench." In the testing room you 
have a tool box in which there are a whole bunch of tools, including a 7/16th 
wrench, and these have numbers on them. The examinee goes to the tool box, 
picks out the proper things, takes the numbers off,, and writes them on his 
answer sheet The person has to exhibit the behavior you actually record. 
Nickel: I'm interested in Spolsky 's question concerning the correlation 
between listening and speaking. From my own experience, I don't exclude a 
certain percentage of learner types who have a greater competence in speak- 
ing than in listening, especially if two factors are present. One. if the topic 
of discussion is not familiar to the examinee, and two, if the accent is changed, 
for example a change from a British to an American accent. 
Spolsky: I'm still trying to g(;t at the point of overall proficiency, I'm con- 
vinced that there is such a thing Kven taking the accent or the style question, 
presumably there'd be very few cases vxhere people will develop productive 
control of several styles before they develuf) receptive control of a wider 
Q of St vies. 

fiS 



Reduced Redundancy Testing: A Progress Report 

Harry L. Gradman and Bernard Spolsky 



In an earlier paper (Spolsky el aL, 1968), some preliminary studies 
were reported of one technique for testing overall second language 
proficiency, the use of a form of dictation test with added noise. The 
purpose of this paper is to reconsider some of the notions of that paper 
in the light of later research. 

The original hypothesis had two parts: the notion of overall pro- 
ficiency and the value of the s[)ecific technique. The central question 
raised was how well the dictation test with added noise approximates 
functional tests with clear face validity. There was no suggestion that 
it could replace either tests of specific language abilities or various 
functional tests (such as the FSI interview [jones, forthcoming] or 
other interview tests (Spolsky et al.. 1972)). Research with the test 
came to have two parallel concerns: an interest in the theoretical im- 
plications of the technique, and a desire lo investigate its practical 
value in given situatioMS, 

The theoretical issuer have now been quite fully discussed (Spolsky. 
1971: Oiler, 1973: Gradman, 1973: Briere, 1969). Assuming the rele- 
vance of what Oiler calls a grammar of expectancy, any reduction of 
redundancy will tend to increase a non-native's difficulty in function- 
ing in a second language more than a native speaker, exaggerating dif- 
ferences and permitting more precise measurement. The major tech- 
niques so far investigated for reducing redundancy have been on 
written cloze tests (Oiler. 1973, ^Virnell, 1970), oral cloze tests (Craker. 
1971). and dictatioi tests ^ > (Spolsky et al., 1968:, Whiteson, 1972: 
Johansson. 1973: Gradman, 1974) and without (Oiler. 1971) additional 
distortion. In this paper, we will discuss some of the more recent stud- 
ies of the dictation test with added distortion and will consider their 
theoretical and practical impli'^ations. 

The original stud\ (S[)olsky eM al.. 1968) described six experiments 
carried out in 1966 at Indiana l'niversit\. In a [)reliminary experiment, 
fift\ sentences from an aural wjmprehension te&t were pre[)ared with 
addtd while no!S<\ Six studenis were asked to write down what they 
heard. The^re wab (,'vidence of correlation between the score on this 
lest and a cum;jiehension score, and non-native si)eakers of English 
were clearh se[)arated from natives, but (he test seemed too hard: 
th( re were too inan\ *'tr' 1 n the sentences, and the sit^nal-to-noise 



59 



60 Testing Languag(i Proficiency 

ratios were somewhat imtuntrulleil. In the second experiment, lack of 
control of si^nal-lo-noise ratio anil dissatisfaction with the sentences 
a^,Min caused concern. In the third [)reliminar\ study, sentence content 
continued to cause confusion, with certain sentences turning out to he 
eas\ or hard under an\ conilitioo'^. In the next ex[)erimenl, the sen- 
tences were rewritten v\ilh an attem[)t made to control sentence struc- 
ture. Following a then current h\[Jothesis suj^gesting that sentence dif- 
ficultv was related to the number of transformations undergone, sen- 
tences were written in which each sentence had the same number of 
words, all words were frequent (occurring at least once in every 3000 
words, and there were five sentences for each of ten structural de- 
scriptions. Groups of 5 sentences were chosen randomly with one 
sentence from each structural t\pe, and ap[)ropriate noise was added. 
Allenlion in this ex[)eriment was focused on the possibility of learn- 
ing, did the lest gel easier as the subject became more accustomed to 
the noise? B\ the end of this experiment, the learning question was 
not answereil. but the problem of sentence construction was becoming 
cle*irer. It was oljvious that sentence structure, semariic acce[)tability, 
word fre(juencv, and [)honological factors could all [)lay a part be- 
sides the noise. At this stage, the effect of reversing the order of the 
signal-lo-noise ratios was tried, and it was determined that learning 
effects could be discounted if the harder items came first. 

The next experiment was a trial of the instrument with 48 foreign 
students. Correlations were .66 with an aural comprehension test 
and .62 with an objective [)ai)er-and-[)encil test, and .40 with an essay. 
But it still seemed too haril; the mixing remained a problem, and the 
phonological tricks added too much uncertain difficultv. It was real- 
ized that "the [jlionological trick* is itself a form of masking, leaving 
dmbiguit\ to clarified b\ redundant features. ConsequenlU , the 
addition of acoustic distortion makes interference two-fold" (S[)olsky 
et al., 1968, [j 94). It remained impossible to s[)ecif\ to what extent re- 
dundancy had l)een reduced. 

The final ex[)eriment in the 1966 series used a set of new sentences 
(without "tricks") hite noise added electronicalh . The test v%as 

gi\en to 61 foreign students, and correlations of .66 with both the aural 
comprehension and the discrete item tests ami .51 with the essa> test 
resulted. The ex[)eriments were summarized as follows: 

These prcdiminarv studies have encouraged us to Ijelieve that 
a test of a sul)ject*s ability to receive me.ssages under varying 
conditions of distortion of the conducting medium is a good 
measure of his overall [)roficienc> in a language, and that 
such a test can l)e easily constructed and administered. 
(Spolsky et al., 1968. p. 7) 





Hf»(iua»(i Heciuntiancy Tcstinf^: A Progress Report HI 



The lechnitjues deSLrihetl \\\ (his fust paper were tested furtlu.T in d 
stiicK rejjorted l)\ Whitt^son (1972). Keokinj^ for ti simple screening de- 
\i(.e for Ifirgf minil)eis of foroijin students. Whileson prijptirtid fift\ 
difft.TtMit stMilenLes on the Stune sliiictuitd rnodtd as those (hjst.ril)tHl 
«d)r)\e. tidtling noist; tn thtini. The resulting Itist. which correKiled <it 
,54 with iinolht»r proficifncv nieiisure. pro\idfd. she felt, evidence of 
l)eing <i good scrt'fning de\i(j;. seiving thi: purposes for which it was 
inleiuitML 

In <i somewhtit iUnhititjus sliuK of (he ti'chniijut? cari led out o\t;r two 
\tMrs. I()hiinsson uufstigtittjd not o[d\ the o\t?rall efft»ct of the; test l)Ut 
studied in detail tht.' chtU ticttMislics of sunie sUidonts with whom it ditl 
not work iis well. He de\eh)ped a new form of the t(;st with a num- 
hvi of Ihisic changt»s (!l the sigUtd-to-nuise rtitios were lower, hecause 
his SvM'dish stud*'nts weui. he l)tditJ\ed. helttjr thtm the ti\er»ige for- 
eign stiultMits in the Indianti studies. [2] iheie were fewer .senttMict.'s. 
(3) thf scntfuct^s wt.-re writltMi with high redund»inc\ (prcsunuihK l)«d- 
an(ang the effi'cl of iht? lower sign.d-to-noise r»ili«)s). (4) elenumls were 
nududed that (.oiild In* cxprt.ttMl to (.aust.* difficnlt\ for Swtnles (sup< 
posedK on the luisis of sornti sort of contitLslivf tintd\sis). (3) the scor- 
ing s\stt!m Wtis chtingtid; and (6) the difficulty order w*is reversed. 
With all Ihfsti chtUiges, tind with the prt)l)al)ilil\ (hat the suhjt.'Cts wen>' 
mort' homog('[ieuus in Knglish knowledge than those i[i the In(li<ina 
Mtud\. \h{) test stdl showed a retisonahh good correlation (.52) ' h a 
test thiit iipptMrs (tis f.ir as one can tell from the tlescriplion) , lave 
heen <i Iriiditiontil test of kiiow ledge of :it»i[uliiul writt*:n Knglis Un- 
ft)rlunaltd\ . howtntsr. this ltitt(?r test tippt'ars to ha\e heen unreliable. 
The dicl.ition tt.'st also correlated well with a phon(Mne tliscrimintition 
test, The rt.'st of lohcinsson's slud\ v\as coru.trrn.jd v.ith those students 
for whom the dicltition test fails to he a gootl pretlictor of *ic«ulemic 
success Ilt^re ht; finds some evidence suggesting thtit there are cert*nn 
kinds of studt^nts whose [)erson<ditv reacts to tests of this kind 
(v\htitln*r hectiuse of noise tdone ot tlie genei al no\elt\) iirul for vxhom 
the results arc» therefore tjuestionahle. 

lohiinsson's stud\ rtiises a numher interesting (juestions. Ulnious- 
Iv. it would he desirtihle to know the effect of the various changes he 
made in tlie form of the ttJsL And his somevxhat extreme concdusions 
ap))ear to he premcitute: a dictation test without noise hut uniler <in\ 
conditions of pressuie is jUst as much a test of reduced redundancy 
as t)ne with noise, so that the theoreticcd difference may he nil. 

In a somewhat mortj useful tissessment of retluced rethindancv tests. 
|ohn Cltirk (forthcoming) suggests that they can he considered as one 
kind of mdirect proficiency lest. This classification is based on the fact 
that they do not need to reflect norm»d Kmguage use situations, but c«m 
be justifietl hy other kinds of Vcdidi'y besid^^s fiice validity. He feels 




62 7*estin|? Languam: Proficiency 



that there has been sufficient e\i(h»nce of concurrent vali(lit\ lo war- 
rant "some optimism" thai indirect measures mi>»hl be efficient and 
economical ua\s of csliintilinji real-Iif<» [)r()ficienc\, but he i)oinls out 
three major cautions. First, (he indirect nuMsures have onlv been com- 
[)ared with other measures which do not themselves have hi«h face 
\alidit\ SocondU. the result of an indirect measure mi^hl need to be 
corrected for the subject s lan^ua^e learning histor\: a written clo/.e 
test will not necessarih predict well the [)erformance of a student 
who has had a [)urel\ oral a[)[)roach. And ihirdK. indirect measures 
will need full e\[)lanation to the users of the relation of their results to 
more obvious tests. 

Some adilitional sets of data have been examined over the i)as! \ear. 
su^^estive of the continued belief in the dictation test with, added 
noise or lb<? noise test, as it is often called, as an effective instrument 
in the evaluation of overall lan^ua^e i)roficiencv . Data gathered dur- 
ing; Januarv and Febru.irv of 1974 from three quite different groups of 
subjects compaie favorablv with similar data [)reviouslv re[)orled on 
(Gradman. 1974). 

Perhaps the nujst thorough analvsis of the noise test has been made 
of 2() Saudi Arabian students enrolled in a special Kn^lish Skills Pro- 
j»ram at Indiana Universilv. The students, all of whom he^an their 
coursework in January of 1974. were ^iven the noise test, the TOEFL 
test, the llvin Oral Interview, and the Grabal Oral Interview. A mul- 
tiple-choice version of the noiscj test was used in which students were 
asked to select from five choices the closest a[)proximation of a sen- 
tence heard on ta[)e with background distorting noise. Fiftv such sen- 
tences were included and. in fact, were the final sentences of the 1966 
experiments. Most correlations were strong enough to suggest a posi- 
tive relationshi[) between performance on the noise test and the other 
instruments. The noise test, for instance, correlated at .75 with the total 
TOKFL score, the highest correlation of the noise test with any other 
test or rOHFL subtest. In fact, with the exce[)tion of the TOEFL English 
Structure and Writing subtests (.44 and .33 lespectively), all correla- 
tions were above .60. Interestinglv enough, vocabularv and noise 
correlated at ,73. which was not particularly exi)ected, nor was the .68 
correlation of the reading com[)rehension. subtest of TOEFL and the 
noise test. The correlation of .69 between the noise test and the Ilyin 
Oral Interview -a test coniiisting ot pictures and s[)ecific questions, 
the answers lo which are recorded bv the interviewer -was the high- 
est of any of the llvin correlations. The correlation of the Ilyin Oral 
Interview with the Grabal Oral Interview ~a test of free conversation 
rated on a 9 [)()int scale for 10 categories by two independent judges- 
for instance, was onlv at the .59 leve^ and with the TOKFL total score 
O jhe .54 level. On the other hand, the Grabal Oral Interview corre- 




Reduced Redundancy Testing; A Progress Report 63 



Idled somev\hal similarK io the noise lest. For insldnce. the Grabal 
and TOEFL toltd correlated at .73. vocabular\ at .71. The writing sec- 
lion of the TOEFL correlated at a [)articiilarl\ low level .17 with the 
Grabal. but this was not imexptcted. \or was the .38 correlation with 
the Reading Comprehension subtest of TOEFL, In a comparison of in- 
lercorrelations between [)arts of the TOEFL test, the IK in. GrabaL and 
noise tests, the onI> higher correlations were between the TOEFL total 
and listening comprehension subtests (.89) and the TOEFL total and 
vocabular} subtest (,85). At the very least, the noise test appeared to 
correlate better with discrete Mem tests (such as the TOEFL) than did 
cither the IK in Oral Interview or the Grabal Oral Interview, both of 
which ma\ be said to be more functionalK oriented than the TOEFL 
lest. B> examining the set of intercorreiation data, the noise test ap- 
[lears to function fairK im[)ressiveK and. in fact, to potentialK bridge 
a gap left o^ht?rwise unattended to b\ the relativeK less structured 
IK in and Grabal tests. This, on the other hand, should not be par- 
licuIarK sur[)rising as the nature of the multiple-choice form of the 
noise lest seems to be a cross between functional and discrete-point 
orientation, thus potentialK ex[)laining its stronger correlations with 
iheTOKFL test. 

The figures do not differ much from those reported earlier (Grad- 
man. 1974) when 25 Saudi Arabian students were administered the 
noise lest, the Grabal Oral Interview, and the TOEFL test. TOEFL and 
noise test correlations, for example, were .66 for overall performance 
and .75 for listening comprehension. The Grabal Oral Interview and 
noise lest correlations were at the .79 level. 

The noise test was given to a class of Indiana University graduate 
students in language testing in Februar\ of 1974. The\ were first given 
the multiple-choice answer booklet (Form H) and asked tosimplv mark 
the correct answers. The purpose of this blind-scormg technique was 
to ilelermine whether or not the answers were so obvious that the lest 
booklet, at least, needed considerable revision. At first examination, 
the results were somewhat disheartening. Of the 33 students who took 
the lest under these conditions, the mean level of performance was 
29 out of a possible 50. with a range of 30 (high of 38, low of 8), and 
even reliabilit\ (Kuder Richardson, [). 21) was .56. somewhat higher 
than we sometimes get on "real tests. " 

However, when the test was given again with the actual test sen- 
tences with added distortion, the results were quite different. The 
correlation between Form B with noise and Form B via Blind Scoring 
was onK .25. a figure which seems reasonable. It suggests, in fact, tbat 
there is some relationship, though limited, between the ability to pick 
out grammatical res[)onses from a list of choices and performance 
on a lest v\ith reduced rodundancv. We would have been surprised 

- 70 



64 Testing Luni^uui^e Prnfwuincy 



had the results been far different Simdar results were also obtained 
when ue ujrrelaltMl performance on the iilind Scoring of Form B uilh 
1-orm A of the noise l(?sl. in which stu(!.Mils are ask(?d to write what 
thev heard o\er the (apt? -a straii»hl diclalion version uilh additional 
noise in the background. Onc(? .i^ain lh(j correlation was .:i:>. 

Form A of the noise lest v\as jihen as a dictation exercise to 34 of 
the same j^roup of students, llsin^ the scoring mcjlhod described in 
Spolsk> el al (!9fj8). the lop 17 scores uere made by native speakers 
of Fn^lish. and the bottom 17 scores v\ere made b\ non-nali\ e speakers 
of Enjjiish These results uere. of course. exactK as ue had hoped. 
The dictation \ersion of the noise lest discriminated between native 
and non-native speakers of English. 

Form B of the noise test, the miiltiple-choice answer version, uas 
^^i\en lo.tht* same* ^^roup of students: and once a^ain. rhe lop 17 scores 
uere made b\ native speakers of English and the bottom 17 scores 
uere made b\ non-native speakers of English. A.s uilh the dictation 
\ersion. the multiple-choice \ersion of the noise lest discriminated be- 
lw(jen native and non-nalivc? speakers. 

An inlereslm)» additional question, of (.ourse. uas the relationship 
between performanct? on Form A and on Form B of the noise test. At 
first, uhen all scores uere examined. lne\ correlated at .80, a rea- 
sonablv hi^h fi>»ure Houever. when ue (.ompared the performance of 
the non-nali\e speakers alone, ignorinj^ the minor readjustment of na- 
lue speak(?r rankin^is. the correlation uas found to be .89. a reasonabU 
52ood indication that both Forms A and H of the noise test uere measur- 
in>^ the same ihin^. 

When ue compare the re.,ults of p^^rformann? on the noise lest with 
the results of that of .i similar mixed jituu]} in U)73. ue fintl them lo !)e 
almost the same?. Correlations Ixjlueen I'orm A and B uere? at the? .8H 
level, and both forms of the noise lest discrmiinated a[)propriatelv 
belv\een riali\(» and non-natue speakers of English (Gradman. 1974). 

The results of an examination of the performam:e of 71 non-native 
speakers of English who utjrt? ^i\(?n Form A of the nois(? le?sl in |an- 
uary of 1974 and the Indiana UniversiU placement examination re- 
main positi\e The noiS(? l(?sl correlated reasonable W(dl v\ilh the 
Indiana placement examination. The t(jsl correlated at S)3 v\ith the 
English structure* subUjsl. uith corrcdalions [jroMressiveU louer for ihe 
\ocabuIarv subtest. phonolo^x. .47. and reading comprehension. 
:r The correlation uith th<? overall test total uas .36. While there is. 
of (.our^e. an indic.ition of relationship betueen the tv\o instruments, 
there are a \ariet\ of reasons lo expect these fi^urcKs to be a bit lower 
than some of lh(? olh(?rs lhal U(? ha\c» setMi. not th(? 1(msI of vxhich is the 
someuhal (liff(?renl nature? of lh(? Indi«ina placement examination 
5t»i«df. The phonolo^v sec.lion of ll.j test, for instance?, is a paper and 




71 



Reduced Redundancy Testing: A Progress Report 65 



pencil discrete item test which max or mav not have an\thin^ to do 
with one's performali\e aural-oral skills. The reading com[)rehension 
section of the lest is particiilarl\ difficult, extending, we believe, be- 
\ond the question of whether or not a student has the abdit\ to read. 
Perhaps the two best sections of the test -the structure and vocabularx 
sections, which are somewhat contextuallv orientetl -did indicate 
stronger correlations. 

A not unex[)ected result was the stron>^ relationship between per- 
formance on the first fort\ sentences of Form A. the dit.lation version, 
and the last 10 sentences. It will be remembered from earlier discus- 
sions (S[)olsk\ et aL. 1968; Gradman. 1974) that the first 40 seconds are 
characterized b\ \ar\in^ decrees of low si^^nal-to-noise ratios, whde 
the last 10 sentences are characterized b\ a hi^h si^nal-to-noise ratio, 
i.e. the? last 10 sentences do not ai)i)ear to be accompanied b\ any 
distorting noisr. In fact, the correlation between sentences 1-40 and 41- 
50 was .93, which ma\ lead one to belijve that as an overall measure 
of lan^ua^e profn.ienc\. the noise test mi^ht just as well be ^iven as a 
dictation test without the adiied distorting noise. Such a correlation is, 
however, a bit deoeptixe in terms of the anaUsis of performance on 
the sentences themselves. The a\era^e percentaj^e correct for sen- 
tences 1-40 differs (.onsiderabl\ from that of sentences 41-50. 39' < as 
opposed to 57' . , a difference of 18' » . (In a similar comparison. White- 
son noted a difference of 12' * in her \ersion of the test, which hcid a 
somewhat different marking s\stem.) In othoi words, the question ma\ 
not be one of re[)la(,ement but rather of the meaning of errors on indi- 
\idual sentences with [)articular si^nal-to-noise relationships. That is. 
we remain mterested in tr\in^ to determine just exactl\ what diffi- 
culties the lan^Ufi^e user incurs at particular levels of reduced re- 
dimdarit\. How much redundancx is necessar\ for different kinds of 
lan^ua^e abilit\. and what linguistic units relate to le\els of reduced 
redundanc\? The theoretical and ap[)lied [)otential remains for the 
testing techni(|ue. regardless of the fact that similar o\erall results 
ini^iht well be obtainable from dictation tests alone. 

Thou*»h we have still barel\ scratched the .surface in terms of work 
to be done on the noise lest, the results thus far ha\e been hi^hl\ en- 
couraging. There are sinne? \er\ basic things ri^ht vvilh it. the noise test 
separates native and non-nati\e s[)eakers without fail, it correlates 
reasonably well with other measures of lan^ua^e i)roficienc\. and it 
appears to he [)arti(:!darl> good in its discrimination of weak and 
strong non-native speakers or Knglish. This is in a lest which can be 
given and marked in a minimum of lime wilii a minimum of difficult). 

REFERENCES 

Brii^rc, Eugene J Current Trends in Second Langiicigc Testing," TKSOL Quarlerly 3.4 

C 72 



6'6' Testing Lan^?ua^»« ProficivMcy 



(December 1969). 333-40 
Clark. lohn " Psvchometric Perspeclues in Language Testing. " To appear in Spolsky, 

Bernard (ed ). Current Trends in Language Testing The Hague Mouton. forthcoming. 
Craker. Hazel V Cluzenlropv Procedure or an Instrument fur Measuring Oral Bnglish 

Competencies uf First Grade Children Ufi^iubh^hed JBd 0 dissertation. University 

of New Mexico. 1971. 

Darnell. Donald K "Clozentrop\ A Procedure for 'testing English Language Profi- 
ciency of Foreign Students." Speech Monographs 37 1 (March 1970). 36-46. 

Cradman. Harr> L Fundamental Considerations in the Evaluation of Foreign Language 
Proficiency " (Paper presented at the International Seminar on Language Testing, 
jointly sponsored b\ TESOL and the AILA Commission on Language 'tests and 't'est- 
ing. May 11. 1973. San luan. Puerto Rico ) 

"Reduced Redundanc\ Testing A Reconsideration " tn O'Brien. M E. Concan- 

non (ed j. Second Language Testing Sew Dimensions. Dublin Dublin Universit\ 
Press. 1974. 

Il\m, Donna /lyin Oral tnlerview (Experimental edition.) Rov\le\. Mass.. Newbury 
House. 1972, 

lohan.sson. Slig An Evaluation of the Noise Test A Method for Testing Overall Sec- 
ond Language Proficiency b\ Perception Under Masking Noise. ' IHAL 11.2 (Ma> 
1973). 107-133 

[ones. Randall The FSI Interview " To appear in Spolskv. Bernard (ed ). Current 
Trends in Language Testing The Hague* Mouton, forthcoming. 

Oiler. John W . Jr "Dictation as a Device for Testing Foreign Language Proficiency." 
Eng/ish Language Teochmg 23'3 (|une 1971). 234'259 

"Cloze Tests of Second Language Proficiency and What The\ Measure." Lan- 
guage Learnmg 23:1 (|une 1973). 103-118. 

Spolskv. Bernard Reduced Redundancv as a Language Testing Tool " In Perren. G.E 
and Trim. | L M (eds.). Applications of Linguistics Selected Papers of the Second In- 
ternationa/ Congress of Applied Linguistics. Cambridge 1969 London. Cambridge 
University Press. 1971. 383-390 

. Bengt Sigurd. Masahito Sako. Edward Walker and Catherine Arterburn. '*PreIim- 

marv Studies in the Development of Techniques for Testing Overall Second Lan- 
guage Proficiencv '* Language Learning 18. Special Issue No 3. (August 1968). 79- 
101. 

. Pennv Murphv. Wavne Holm and Allen Ferrel. "Three Functional Tests of Oral 

Proficiency." TESOL Quarter/y 6:3 (September 1972). 221-236 

Whiteson. Valerie The Correlation of Auditory Comprehension with General Lan- 
guage Proficiencv. At;dio-Visua/ Language Journal 10.2 (Summer 1972). 89-91 

DISCUSSION 

Tetraiilt: ( jjuld um (.oninient on correl.ilions with direct measures? 
Gradman: ^mi rnav rtM.,ill wbdt i mentioned alioiit the Gral)al oral interview, 
which was m fact simply an oral inl(irview test. Th() nois<. le'st correlate'd at 
04 with that partu.ular measiirenrient. which v\e' thought Wtis a fairly strong 
(correlation That is as dtrt^.t a measure as we have. 1'he Ilyin oral interview, 
which sornr pro[)le arr a little nrjiiative al)oiil. witli pictures and particular 
sentences that you have to ,isk questions about, showed a iillhj higher corre- 
lation. b9. Hut this te\st. as I mentioned. seenuMl to l)ridge aga[) lietwee^n direct 
and (liber indirect m(;asiir<;s. 

Clark: I bedieve yoii said you luid lh(; highest corr(dalions l)(»tween the r.oise 





Reduced Redundancy 'J'esting* A Progress Report 67 



test and the TOEl'L This mivhl he exphimeil h\ the facl that the TOEKL ibelf 
has high internal reliahiht\. and it ma\ well \nt that if \ou v\ere to correct the 
criterion for iinrehahditv in the Il\in oral iiiter\ie\\ and other direct tests. 
\uii Aoiilil gel e\en more fa\ ur<ihh^ cDrreKitioiis lh<in aie iiidi(.<ited here. 
Lado: How was the test scored.** 

Gradman: We scored fi\e [joints in the dictation \ersion if ever\ thing was 
correi.1 We ignort-d s[)elhng and [)uncliialion. Four [)oinls for one error. An\- 
thmg nore than one error, all the wa\ down to simpK one word right. ;\as 
one p»>ml Nothing right was zero In olhjjr words, we u.sed 3, 4, 1. and 0. But 
the curr»dations between tins and llu? multiple-choice \ersion, whore we 
simpK ga\e one point if it was[)i(.ked correctl\ from fi\e allernatues. were 
quite high We ha\en*t compared it witli Joliannson's s\stem, which is a bit 
different. I think his was 3. 2, 1. 

Lado: Ue all seeai to lia\e acce[)led the idea that looking at a picture and 
t<ilkmg ahout it is au indirect lechnicjue i don't think it s indirect at all. 
Spolsky: I'd like to take up that question of what an indirect or direct tech- 
nique IS It s iHSbiblcr to think u[) real-life contexts in which something like 
the noise Itjsl occurs, m other words, listening to an announcement in an air- 
piTl. or Irving to hear an item on the news when the radio is fuz/.\. So one 
can. in fact. sa\ that e\en this indirect measure can he considered a direct 
measure of a \er\ spe^.ific functional acti\it\. The question then becomes, 
how wideK a single kind of measure like this will correlate with all the 
others What interested us initialK was the notion of o\erall proficiency, 
which we thought was something that should correlate with general language 
knowledge. We <n\d{n\ th(? noise in hopes of getting some agreement with 
information theory's models of being able to actuallv athi redundancy in a 
technically measurable way In this way you can say that the testees knowl- 
edge of the language is e(|ui\alent to adding so much redundancy, or even 
carrying it through to (jueslions of intelligibility, and that this accent is an 
intelligible ccjuivalent to the following kind of noise. 
Jones: What's your definition of overall [)roficien(:y? 

Spolsky: It's son^ething that [)resumably has what Alan I)a\ies would call 
construct validity In other words, it de[)ends on a theoretical notion of 
knowledge of a language and the assum[)tion that while this knowledge at a 
certain h'\(d can be divided up into various kinds of skills, there is some- 
thing underlying the various skills which is obviously not the same as compe- 
tence. You have to allow, of course, for gross differences. F'or example, if 
somebody is deaf he won't be very good at listening, if somebody hasn't 
learned to read or write he won't be good at reading or writing, and if .some- 
body has never been exposed to sf)eecli of a certain variety he won't be 
good at handling that. And after allowing for those gross, very specific dif- 
ferences of ex[)erience. whatever is left is overall f)roficien(:y. 
Anon: What is reduced redundancy':^ 

Gradman: Presumably language is redundant, that is. there are a variety of 





68 Testing Langua^{« Proficiency 



clues in d sentence. B\ adding noise to the background, it's possible that some 
of the structural features, at least. ma> he obscured, hut the message ma\ still 
come through As a matter of fact, the test shows the point at which native 
speakers can operate with less of the message than non-native speakers need. 
PresumabK that means that language is redundant enough so that, when 
onl\ part of the message comes through, it can still be interpreted by a 
native speaker but not b\ a non-native speaker. It's kind of the experience 
>ou get sometimes when \ou listen to the radio and there's static in the back- 
ground, but \ou can siill hear the mes^jage. A lot of people complain about 
having to talk to non-nalive speakers over the telephone, because the phone 
itself is just an acoustical device and the> can't understand them nearK as 
well as they can face-to-face. 

Cartier: In the 1940s there was a considerable amount of research done by 
Bell Telephone Laboratories and other people on the redundancy in the 
sound signal, in the acoustic signal of speech. One of the things thev did, for 
example, was to take tape recordings and go through and clip out little 
chunks. The indications were then that the acoustic signal contains twice as 
much acoustic information as is necessdr> for a native speaker of the lan- 
guage to understand a telephone message. There are other ways that lan- 
guage is redundant besides acousticalK. We use an s ending for verbs when 
the subject is he» for example* though the he itself indicates that that's third 
person, making the s on the end of the verb redundant. One way to reduce 
the redundancy, then, would be to knock off that morpheme. There are many 
wa>s >ou can reduce the redundancy in the language, and still have it intel- 
ligible to native speakers. And what Spolsky is trying to do is experiment with 
various kinds of reduction of that redundancy to see what it does in the 
testing situation. 

Davies: I'd like to ask whether the experiments with reduced redundancy 
have concentrated on the facts of the message, or whether you're also taking 
into account the atMtudes of communication, whether it's the total communi- 
cation or just the bones of the message*? 

Spolsky: Most of the work with the noise test has been done with single 
sentences, and with simplv the abilit\ to recognize those sentences or to write 
them down. Until one moves into larger contexts, which I understand is 
planned, it would be impossible to get into any of these other aspects. 
Risen: Earlier someone suggested just introducing noise on every tenth word, 
and I wondered if that might not be introducing more variables than it con- 
trols. I'm thinking about some studies that wert done with introducing clicks, 
where it was found that, if the clicks occurred near a syntactic boundary, it 
introduced less interference than otherwise. 

Spolsky: Presumably, if you do this in a statistical way — randomly — with 
these noises appearing in a statistical rather than in a linguistic pattern, you'll 
overcome the effect of that phenomenon if it does work the same way as in a 
cloze test. Vou can do it where you take out certain parts of speech, but that's 





Reduced Redundoncy Testing: A Progress Report 69 



a '.er\ tiifferent kind of cluze lesl from one where >ou take out ever\ fifth t)r 
sixth word, and certain of these words that ^et taken out happen to be harder 
than other words for ver\ ^ood reasons. As lon>,' as \ou're adding the thing 
randond\ in a statistical wa\. \oure breaking across an\ of these linguistic 
principles or averaging them out. 

Garcia-Zamor:. Td like to address nu question to the person who said earlier. 
"I belie\e in overall proficienc\."' I wanted to ask >ou precise l\ in which wav 
\ou Swe that overall proficiencv might differ from the sum or average of one's 
competence in tht? different aspects of language that \ou might be able to 
isolate'*^ Unless it's significantU different from that. I don't see an\ meaning 
in the term "overall proficiencv." 

Spolsky: It should be obvious by now that I can't sav that precisely, or I 
would have. It's an idea that Vm still plaving with. It has to correlate with the 
sum of various kinds of things in some wav. because it should underlie any 
specific abilities. In other words. I have the notion that abililv to operate in a 
language includes a good, solid central portion (which 1*1! call overall profi- 
ciencv) phis a number of specific areas based on experience and which will 
turn out to he either the skill or certain sociolinguistic situations. Given a 
picture like that, one can understand whv there are such good correlc'ions 
f)elween almost anv kind of language test and anv other kind of language test. 
Whv. m fact, one is surprised at not finding correlations. I'm told that of all 
the tests that KTS has. the ones in which thev get the highest internal reliabili- 
ties are language tests I'heoreticallv . at least, two people could know very 
different parts of a language and. having a fairlv small part in common, still 
know how to get bv. Thais where overall proficiencv becomes important. 
Clark: I basicallv agree with that But then we come back to the question of 
what the specific learning hislorv of the student is. and I could see a situation 
in vwhich the teacher wouldn't sav a word in the foreign language during the 
entire course but would show printed materials with English equivalents, for 
example Then if a listening comprehension test were to be given at the end 
of that particular course. I don't think we would have the general proficiency 
V ou're talking about. 

Spolsky: The question is. "How do vou capture overall proficiency'?" Taking 
the two kinds of measures that theorelicallv are closest to it -the dictation 
with or without no.se and the cloze test (which for good theoretical reasons 
are both cases of reduction of rcdundancv ) -it's quite obvious that a student 
who has never learned to re.id won't do anything verv intelligible with the 
cloze lest.^And the same is oi)vious with a student who has never heard the 
language spoken, he won't do anv thing intelligent with the noise test. But 
excluding these extreme cases, you would assume that there is a fairly large 
group with minimal kncwiedgc of each that will show up well in the middle. 
Stevick: I wonder if there is anything relevant from the Peace Corps experi- 
ence, whore we had fairly large numbers of people coming in who had 
studied French or Spanish, who on initial testing turned out to be 0 or 0 



ERIC 




70 Testing Language Proficiency 



dpparenll> not much belter than an dbsolule beginner, but who. when ex- 
posed to the spoken language, bloomed rather rapid)\ ? That ma\ be another 
example of the same thing. 

Spolsky: Thai would be equivalent to a situation m which someone ex- 
posed to the traditional method of learning a language, that is. a grammar- 
translation approach at school, and then goes to li\e in the country for two 
months At the beginning of the two months that person would test out com- 
pleteK at 0 or something on any kind of oral test. But he already has this 
overall proficiency that is just waiting for new experiences. 
Rolff: Mr. Gradman. \ou mentioned five tvpes of sentences, but could you 
mention specifically what tvpes of sentences, and why vou chose to use them 
in the reduced redundancy test'' 

Gradman: Those were actuallv Spolskv's sentences back in 1966. The initial 
study, by the vvav. is reported in Special Issue iVumber 3 of Language Learn- 
ing. 19H8. There were simple negatives, simple negative questions, simple 
questions, simple passives, a category called embedded, embedded negatives, 
embedded tjuestions. embedded questions signaled bv intonation only, em- 
bedded negative questions, and a category called miscellaneous, 
Spolsky: *rhose with memories that go btick to 1965-66 will remember that in 
those days we vvere talking of models of grammar that assumed that sentence 
difficulty could be described by the number and kind of transformations. 
Rashbaum: 1 was very curious about the type of noise that was used to distort 
the speech, and 1 was wondering whether actual distortion by varying the 
pilch or other things had b<,j.i considered in reduced redundancy? 
Spolsky: We tried a number of different kinds of noise at one stage. We 
found that, for the person taking the test, the most difficult of these was. in 
fact, background conversation, especially when it was in the subject's native 
language- But then we decided to use white noise, which seemed to have all 
the sort of basic characteristics to do the job. Somebody else suggested pink 
noise. Fm not sure of the difference. I m told that it might have been better 
for this sort of thing. 
Anon.: What is white noise? 

Cartier: White noise sounds like this, sh/sh/sh/sh/sh. It's simply random 
frequencies at random amplitudes, the basic kind of noise that you hear in 
back of radio broadcasts It's called white because it has the same charac- 
teristics as white light, that is. all fr(i(|uencies are represented at random, f 
guess pink noise is just a little more regular in frecjuency 
Rickerson: I think it's demonstrable that reduced redundancy testing will, in 
fact, distinguish native speakers from non-native speakers. Could you com- 
ment further on the applicability of that type of testing, though, to establish- 
ing^ the gradations oj^ 1. 2. 3. 4. 5 in proficiency? It would seem rather difficult 
to do. 

Gradman: VVt found it performs fairly well in terms of separating out the 
very good and the very bad. Wo have trouble in the middle. 





Dictation: A Test of Grammar Based Expectancies 

John W. Oiler, Jr. and Virginia Streiff* 



I. DICTATION REVISITED 

Since the publication of "Dictation as a Device for Testing Foreign 
Language Proficiency" in English Language Teaching (henceforth 
referred to as the 1971 paper).; the utility of dictation for testing has 
been demonstrated repeatedly. It is an excellent measure of overall 
language proficiency (Johansson 1974; Oiler 1972a. 1972b) and has 
proved useful as an elicitation technique for diagnostic data (Angelis 
1974). Although some of the discussion concerning the validity of 
dictation has been skeptical (Rand 1972; Breitenstein 1972). careful 
research increasingly supports confidence in the technique. 

The purpose of th*s paper is to present a re-evaluation of the 1971 
paper. That data showed the Dictation scores on the UCLA English as 
a Second Language Placement Examination (UCLA ESLPE IJ corre- 
lated moie highly with Total test scores and with other Part scores 
than did any other Part of the ESLPE. The re-evaluation was prompt- 
ed by useful critiques (Rand 1972; Breitenstein 1972). An error in 
the computation of correlations between Part (subtest) scores and 
Total scores in that analysis is corrected; additional information con- 
cerning test rationale, administration, scoring, and interpretation is 
provided; and finalK. a more comprehensive theoretical explanation 
is offered to account for the utility of dictation as a measure of lan- 
guage proficiency. 

In a Reader's Letter. Breitenstein (1972) commented that many 
factors which enter into the process of giving and taking dictation 
were not mentioned in the 1971 paper. For example, there is "the 
eyesight of the reader" (or the "dictator" as Breitenstein terms him), 
the condition of his eye glasses (which "may be dirty or due for re- 
newal"), "the speaker's diction." (possibly affected by "speech de- 

*We wish to thank Professor Lois Mcintosh (UCLA) for providing us with a detailed 
description of the test given in the fall of 1968. It is actually Professor Mcintosh whose 
teaching skill and experience supported confidence in dictation that is at base respon- 
sible for not only this paper but a number of others on the topic. We gratefully ac- 
knowledge our indebtedness to her Without her insight into the testing of langauge 
skills, the facts discussed here, which were originally uncovered more or less by acci- 
dent in a routine analysis, might have gone unnoticed for another 20 years of discrete- 
point testing. 



71 



ERIC 




72 Testing Languajjo Proficiency 



feels or an ill-fitting denture"), **the size of the room," **the acoustics 
of the room/' or the hearing acuitv of the examinees, etc. The hyper- 
bole of Breitenstein's facetious commenlarv reaches its asymptote 
when ho observes that "Oiler's slitenient that 'dictation tests a broad 
range of integrative skills' is now taking on a wider meaning than 
he probably meant, " 

Quite apart from the humor in Breitenstein's remarks, there is an 
im|)Iied serious criticism that merits attention. The earlier paper did 
not mention some important facts about how the dictation was se- 
lected, administered, scored, and inter[)reted. We discuss these 
questions belovv.^ 

Rand's critique (1972) suggests a re-evaluation of the statistical data 
reported in the 1971 paper. Rand correctly observes that the inter- 
correlations between Part scores and the Total score on the UCLA 
KSLPK 1 were influenced b> the weighting of the Part scores. (See the 
discussion of the test Parts and their weighting below.) In order to 
achieve a more accurate picture of the intercorrelations, it is neces- 
sarv to adjust the weightings of the Part scores so that an equal num- 
ber of |)oints are allowed on each subsection of the test, or alterna- 
tivelv to svstematically eliminate the Part scores from the Total score 
for purposes of correlation. 

II RE-EVALUATION' OF DATA DISCUSSED IN THE 1971 PAPER 

We will present the re-evaluation of the data from the 1971 paper in 
three parts: (1) a more complete description of the tested^population 
and the rationale behind the test (in res[)onse to Breitenstein 1972), 
(2) a more complete description of the test, and (3) a new look at the 
Part and Total score correlations (in response to Rand 1972). 

Population and Test Rationale 

The (/CM KSLPE I was administered to about 350 students in the fall 
of 1968, A .sample of 102 students was selected. They were representa- 
tive of about 50 different language backgrounds. About 70 percent of 
them were males, and 30 percent females. Ap[)roximately 60 percent 
of the students were graduates, while the remaindef were under- 
graduates with regular or part-time status. (See Oiler 1972c for a 
description of a similar po[)ulation tested in the fall of 1970.) 

The objective of iha lest is ti measure Hngli.sh language proficiencv 
foi plactHUL'nt purposes. Students who have near native speaker pro- 
ficuincv aie exempt^jd from KSb courses and are allowed to enroll in 
a full course load in their regular studies. Those students who have 
difficulties with English are retjuired to lake one or more courses in 
remedial English <ind mav be limited to a smaller (>ourse load in their 
r*»^nilar course of study. 




7f) 



Dictation: A Te.si of Grammar Based Expectancies 73 



Prior lu 1969 when iho resiMruh repurled in the 1971 \),\\n^T was 
Lcirried out. the UCLA KSLPE 1 h<ul ntjv er boen subjocleil to the close 
em|)inccil sc.riilun of anv slcUislicdl cinahsis, II had been assumed 
(Mrher that Pari I mecisuretl skills lIdsuK assoi.ialed with reading 
com|)rehensu)n. Pari 11 indicated how w<dl stiuitjnts could handle 
Knghsh structure. Part III was a good measure of essa\ writing abilit\, 
Part IV tested discrimination skdls in the area of sounds, and Part V 
was a good measure of s[)eliing and listening comfirehension. The 
extent of uverla[) betwecMi the various Parts, and the meaning of the 
Total score. wtMC aclualK unknown. The intent of the test was to 
provuitj a rehablt; and valid tjslimale of overall skill in P^nglish along 
with diagnostic information concerning possible areas of s[)ecific 
w eakness. 

It vvouhl not be difficult to formulate criticisms of the test as a 
whole and its [jaiticular subsections independent of an\ statistical 
anahsis. This is not the concern of this [laper. however. What we are 
inleresltul in aic' answers to the following questions. Given the several 
parts of the UCLA KSLPK I, what was the amount of overlap between 
ihein.** Was there ontr subtest that [irovided more information than the 
resl'^ Should anv one or more subtests have been replaced or done 
aw . with.'* Thes*' art; some of the concerns that [irompted the analy- 
sis presei.'led in the 1971 [lajier and which, together with the observa- 
tions stattul earlier in this paper, motivated the computations reported 
here. 

Description of the Test: VCLA ESLPK 1 

The VCLA ESLPE 1 consists of five parts. Part I, a Vocabulary Test of 
20 items, requires the student to match a word in a storv-like context 
with a synonym. For example: 



Tht) stiidtMit reads the context and then selects from (A), (B), or (C) 
the one that most nearh matches the meaning of the stem word 
FOSTERED. 

Part II IS a Grammar Test of 50 items. Each item asks the student to 
select the most acceptable sentence from three choices. For instance: 



But the frontier fostered 
[lositive traits too. , . , 



POSTERED 



(A) discouraged 

(B) promoted 

(C) adopted 



(A) The boy's [larents let him to play in the water. 

(B) The boy's parents lei him play in the water. 

(C) The boy's parents lot him playing in the water. 



Part III is a Composition. Students were instructed: 



74 Testing Lannuo^e Proficiency 



Write a composition of 200 words, discussing ONE of the follow- 
ing topics. Your ideas should be cle< : and well organized. When 
you have finished, examine \our papur carefully to be sure that 
vour grammar, spelling and punctuation are correct. Then count 
the number of words. PLACE A LARGE X after the two hun- 
dredth word (200). If you have written fewer than 200 words give 
the exact number at the end of your composition. Choose ONE 
and ONLY ONE of the following topics: 

L An interesting place to visit in my country. 

2. Advances in human relations in our time, 

3. A problem not y(;t solved by science. 

4. The most popular sport in my country. 

Part IV, Phonology, tests perception of English sounds. It consists 
of 30 tape recorded items. The student hears a sentence on tape. The 
sentence contains one of two words thai are similar phonologically» 
e.g. long and wrong as in "His answer was (A) long (B) wrong." The 
student has a written form of the sentence on the test paper and must 
decide which of the two words were on the tape. 

Part V is n Dictation. The Dictation is actually in two sections. The 
two passages selected are each about 100 words in length. One is on a 
topic of general interest: the other has a science-oriented focus. The 
material selected for the Dictation is language of a type college-level 
students are expected to encounter in their course of study. The stu- 
dent is given the following instructions in writing and on tape: 

The })urpose of this dictation exercise is to test your aural com- 
prehension and spelling of English. First, listen as the instructor 
reads the selection at a normal rate. Then proceed to write as the 
instructor begins to read the selection a second time sentence by 
sentence. Correct your work when he reads each sentence a 
third time, The instructor will tell you when to punctuate. 

The student then hears the dictation on tape. The text for the UCLA 
ESLPE I follows: 

(11 

There are many lessons which a new student has to learn when 
he first comes to a large university. Among other things he must 
adjust himself to the new environment: he must learn to be inde- 
pendent and Vvise in managing his affairs: he mast learn to get 
along with many people. Above all, he should recognize with 
humility that there is much to be learned and that his main job is 
to grow in intellect and in spirit. But he mustn't lose sight of the 
fact that education, like life, is most worthwhile when it is en- 
O joyed. 



DicUition, A Teat of Grammar Based Expectancies 75 



In scientific inquirv. it becomes a matter of duty to oxpose n 
su|)pose(l law to ever\ kind of vilification, and to take care, 
moreover, that it is done intentionally. For instance, if \ou drop 
something, it will immediatelv fall to the ground. That is a very 
common verification of one of the best established laws of na- 
ture-lhe law of gravitation. We believe it in such an oxtensive. 
thorouj^h, and unhesilatmg manner because the universal experi- 
ence of mankind verifies it. And that is the strongest foundation 
on which any natural law can rest. 

The scoring of Parts Mil. all of which were multiple-choice ques- 
tions, was pureh objective. Each item in Part I was worth 1 point, 
the whole section being worth ?.0 points. Items in Part II were each 
worth ' ' point, making the whole section worth 25 poinl.s. Part III was 
worth 13 points, with each item valued at ' i)oint each. 

Parts IVand V recpiire more e>.plaiialion. Part IV was worth a total 
of 25 points with each error subfracling »j point. Students who made 
more than 50 errors (with a m.ix.nium of 1 error per word attempt(>d) 
v\ere given a score of 0. There w<;re no negative scores, i.e. if a stu- 
dent made 50 errors or more, he scored 0. S[)elling errors were 
counted along uilh'errors in word order, grammatical form, choice of 
words, and the like. If the student wrote less than 200 words, his 
errors were pro-rated on the basis of the following formula: Number 
of words written by the .student 200 words = Number of errors 
made b\ the student X. 

The variable X is the pro-rated number of errors, so the students 
pro-rated score would be 25 - j)X. For example, if he wrote 100 words 
and made 10 errors, by the formula X = 20. his score would be 
25 - '-'(20) = 15 points. The scoring of Part IV involved a considerable 
amount of subjective judgment and was probably le.ss reliable than 
the scoring of any of the other sections. 

A maximum of 15 points v\as allowed for the Dictation. Clear errors 
in spelling (e.g. shagrin for chagrin), phonologx (e.g. iong hair for 
lawn care), grammar (e.g. it became for it becomes), or choice of 
wording (e.g. humanity for mankind] counted as U point subtracted 
from the maximum possible score of 15 [)oints. A maximum of U 
point could be subtracted for multiple errors in a single word. e.g. an 
extra word inserted into the text which was ungrammatical. mis- 
spelled, and out of order would count as only one error. If the student 
made 60 errors or more on the Dictation, a score of 0 was recorded. 
Alternative methods of scoring are suggested by Valette (1967). 

Part and Total Intercorrelations on the UCLA ESLPE 1 
J^^e surprising finding in the 1971 paper was that the Dictation corre- 



76* Testing Longuoxfi Prof'Ci«ncy 



laled belter uilh each other Part of the UCLA ESLPE 1 than did 
an\ other Part. Also. Dictation correlated at .«() with the Toted score, 
which was onl\ sli^htK less than the corrtdation of .88 between die 
Total and the Composition j coro. What these tiata su^^ested was that 
the Dictation was [)ro\ uiin^ more information concern'u^ the totalit\ 
of skills bein« measured than an\ other Part of the tost. In fact, it 
seemed to be tapping an iiniierUing competence in Kn^iish. 

The data presenteti in the 1971 paper, however. ha\e been (iiies- 
lioned In Rand (1971!). As mentioned earlier. Rand (1972) correctly 
observes that the weightings of Part scores will affect their correlation 
with the Total sctjre. ObviousK. there is perfect correlation between 
the f)orti()n of the Total .score and the Part score to which it corre- 
s|)ontis. .Also, differential weightings of scores will have slight effects 
on Part and Total correlations even if the self-correlations are sys- 
lemalicalK eliminated. If Part scores are unevenly weighted (which 
ihev were m the 1971 paper), the inlercorrelations between Part 
scores and the Total will be misleading. 

One wa\ of removing the error is to adjust the weightings of the 
Part scores so that eac* uirt is worth ai\ e(iual number of points 
toward the Total. 'Table 1 presents the results of a re-analysis of the 
data on just such a basis (see Appendix). For convenience of com- 
parison the correlation data from the 1971 paper is reproduced as 
Table 11 (see Appendix) Table 11 was actually based on 102 subjects, 
rather than 100. as was mcorrectlv reported in the earlier paper. Two 
errors in the data deck discovered in the re-analysis and corrected 
in Table 1 are not corrected for Table 11. It is reproduced (»xactly as 
it was originalK presented in the 1971 paper. 

It IS notewordu that the re-anal\sis (see Table Ij shows a .94 cor- 
relation between the ati|uste(i Dictation score and adjusteti TolaL 
whde the correlation between Com^JOsitIon and Total is reduced from 
.88 ( Table 11) to .83 (Table 1). CorreM:tions of the two errors detected 
in the data cards account for the slight discrepancies in intercorrela- 
lions between the Parts in Tables 1 and 11. 

The data indicate that the Dictation b\ itself could validly be sub- 
stituted for the Total (where the Total computed by adding the 
equalK v\eighted scores on \'ocabular\. Grammar. Composition. 
Phonology, and Dictation). 

Table 111 (see A^)[)endlx) presents ct>rrelations with the Total scores, 
eliminating self-correlations of Parts in a step-wise fashion. In other 
words, each Part is correlated with the Total computed by the sum of 
scores on the remaining Parts. For example. Dictation is correlated 
with the sum of V'ocabularv. Grammar. Composition, and Phonology. 
Here again we see cleat 1\ the superior performance of Dictation as a 
measure of the compo.site of skills being tested. 

^ 83 



Dictation. A 7eat of Grammor Based E.Kpectancies 77 



Together uilh ;he tMilier rtjsecirch of N alelle (I9()4. 1967). the 
follou-iip research of jjhcinsson (1974). and Oiler (I972ci. 1972b. 
1972c). the forej^oinj^ consliliites a clear refiilalion of the claims In 
lanjiiia^ie leslin^j e\[jerls thai diclaliun is not a ^ood langiia>^e lesl (cf. 
Harris 1909, Lado 1961. Somaralne 1957; Anderson 1953 as ciled in the 
197! paper bill not in the references to this paper). 

Moreove-. the hi^ih correlations achieved repeatedl> betueen dicta- 
tion and otner integrative tests such as the cloze i)rocediire (see Oiler 
1972b 1971c) sii[)port a [)S>cholin«iiiistic basis contrar> to much recent 
theorizing (see TOKFL. /nterpretive xMonual. 1970) for interpreting^ 
intercorrelations u. tests of binmia^e [)roficienc>. When intercorrela- 
tions betueen diverse tests are near or above the .90 level, a [)S>- 
cholinguistic model leads us to infer hi^h lest validitv for both tests. 
\\\ a cloze tost, for e\am[)le. material is [jresented \isuall>. whereas in 
dictation, it is presentetl auditoril>. When such vastly different tests 
consistentK intercorrelate at the .85 levod or better (cf. Oiler 1972c. 
and ieferenct\s). ue ma> reasonable conclude that the\ <»re tap[)in^ an 
und(»rl>in«j com\wiGnce.. Since ue can assume on the >^rounds of inde- 
pendent [)s>cholin>iuistic research that such an underl>inj» com- 
petence exists, ue mav vvithout dan«ier of circul ar ^e.:sc;ninj4 ar^ue 
that the luo tests crijss-v alidate each other. Obviousl> this uill lead 
us to expect hi^h intercorrelations betueen xaWd lan^^uaj^e tests of all 
sorts Lou intercorrelations must be interjTeted as indicating lov\ test 
validitv. i.e. that one of the tests bein^ correlated does not lap under- 
Kinj^ lin^'ilstic competence or that it does so to an insufficient extent. 

lit HOW OOKS DICT.'\TIO.\ ME.ASI;RK I,A\GL?AGK COMPETKN'CKf 

The com[)lexit> of taking dictation is >^reater than mij»ht have been 
suspected before the advent of "constructivist" models of speech per- 
ception and information [)rocessin;^ (X'eisser 1967: Chomskv and 
Halle !9()«; Coo[)er 1972; Stevens and House 1972: Liberman et al 
1967) The claims underb.in^ these ps>cholin>iuistic models is that 
com[)rehension of s[)eech. like other perceptual activitie.i. requires 
aclive anal>sis-b>-s>nthesis. "All of these models for [)erception ... 
hive in cmmon a listener who aclivelv participates in producing* 
spe:M:h € well as in listening to it in order that he mav compare . 
(his s>ntf';sis) with the incoming [setjuence). Il mav be that the com- 
I)arators .,:e the functional component of central interest. . , . '^ We 
sum*J>l that the comparator is no more nor less than a grammar of 
expecttMc; . It seems that the [)ercciver formulates ex[)ectancies (or 
hv|)olhc»ses| uoacernin^ the sound stream basetl on his internalized 
grammar of the languai^e.* We refer to this process in tlu; title of the 
paper where we sii^^es» that dictation is a device v\hich measures the 
efficiency of grammar-based expectancies. 



78 Testing Language Proficiency 



\eisser (1967) posits a tuo stage model of cognitive processing of 
speech input and other sorts of cognitive information as well. In the 
case of speech perception, the listener first formulates a kind of 
synthesis that is "fast, crude, wholistic. and parallel ": the second 
stage of perception is a " (leliberate. attentive, detailed, and sequen- 
tial" analysis. We max applv this model to the writing of a dictation, 
providing that ue remember there must he a rapid-fire alternation 
between synthetic and analytic processes. We ma> assume that a 
non-native speaker forms a " fast, crude . notion of what is bjing 
talked about (i.e. meaning) and then anaUzes in a 'deliberate, atten- 
tive . . . setjuential'" fashion in order to write down the segmented and 
classified se(|uences tht»' he has heard. As Chomskv and Halle (1968) 
suggest in another context, " the hvpothesis (or "svnthesis based on 
grammar generated expectancies." in our terms) will then be ac- 
cepte(i if it is not too radicallv at variance with the acoustic mate- 
rial.*** Of course, if the students (or listener's) grammar of ex- 
pectancv is inc.onipltMe. the kinds of hvpotheses that he will accept 
will deviate substantiallv fruin the actual sequences of elements in 
the dictation. When students convert a phrase like "scientists from 
manv nations * into " scientist s imaginations" and "scientist's exam.i- 
nations.*' an active analvsis-bv-svnthesis is clearly apparent. On a 
dictation given at L'CL.A not lonu ago. one student convert^jd an entire 
paragraph on "tirain cells ' into a fairlv readable and phone»'callv 
similar paragraph on "brand sales.'' It would be absurd to suggest that 
the process of analvsis-bv-svnthesis is onlv taking place when stu- 
dents make errors. It is the process underlying their listening behavior 
in general and is onlv more obvious in creative errors. 

Since dictation activates the learner's internalized grommar of 
expectancy, which we assume is the central component of hi,s lan- 
guage competence, it is not surprising that a dictation test yields 
substantial information (.oni.erning his overall proficiency in the lan- 
guage - indeed, more information than some other tests that have 
been blessed with greater approval bv the "experts" (see discussion 
in the 1971 paper). As a testing device it "vields useful information on 
errors at all levels " (Angelis 1974) and meets rigorous standards of 
validitv (johansson 1974). It seens likely to be a useful instrument 
for testmg short-term instructional goals as well as integrated lan- 
guage achievement over the long-term. There are many expjrimental 
and practical u.^es which remain to be explored. 



1. The pjiper riiferrod to orUially «ippoarc(l first in UCLA TKSL Workpaptifs 1 (1970). 
37-41 It was published .subse(juenr!y in Knglish loi guage Teaching 25 3 ([unc 1971 J. 
254-9. and in a revised and expanded form in H B. Allen and K ^? Campbell, eds,. 



NOTES 





Dictation, A Test of Grammar Based Expectancies 79 



Teaching English as a Second Language: A Book of Readings. New York. McGraw 
Hill. 197'*, pp. 346-54. 

2- On the other hand. Breitenstein s remarks also indicate two serious misunderstand- 
ings The first concerns the use of dictation as a test. Breitenslein suggests. lot us not 
forge! that in our mother tongue we can fill in gaps in what we hear up to ten times 
belter than in the case of a foreign language we have not yet mastered" (p. 203). 
ignoring the trivial matter of fireitenstein'.s arithmetic and its questionable empirical 
basis, his observation does no! point up a disadvantage of dictation as a testing 
device -rather a cruciai advantage. It is largelv the disparity between our ability to 
••fill in gaps in our mother tongue" and in a "foreign language'" that a dictation test 
serves to rcveaL 

The second misunderstanding in Breitenstein's letter concerns student ♦?rrors. 
He says, "the mistakes are there, but are they due to the dictatcir." the acoustics of 
the room, the hearing of the candidate, or his knowledge?" (p. 203). .Admittedly, bad 
room acoustics or weak hearing may result in errors unique to a particular student, 
but difficulties generated by the person giving the dictation will show up in the 
performance of manv if not all of the examinees and. contrary to what lireitenstein 
implies, it is possible to identify such errors. Moreover, the purpose of the particular 
dictation Hreitenstein was discussing was to measure the listening comprehension of 
college-level, non-native speakers of Engli.sh under simulated classroom listening 
conditions. To attempt perfect control of acoustic conditions and hearing acuity 
would not be realistic. An important aspect of the ability to understand spoken 
English is being able to do it under the constraints and difficuhies afforded by a 
norma! classroom situation. 

3. Cooper 1972. p. 42 

4. Throughout this paper we ass ume a pragmatic definition of grammar as di5cussed 
by Oiler iVJTQ 1973a|. Oiler and Richards (1973). The main distinction between this 
sort of definition of grammar and the early Chomsky an paradigm is our claim that 
one must inclucV semantic and pragmatic facts in the grammar. Also see Oiler 
(1973b|. Later Chonskyan theory has begun to take steps to correct the earlier in- 
adequacy (Chomsky 1972). 

5. As cited by Cooper W2. p. 41. 
REFERENCES 

Allen. H- B and R. R. Cumpbell. eds. (1972). Teaching Fnghsh os a Second Language; 

A Book of Readings. \'ew York: McGraw Hill. 
Angelis. P. (1974) "Listening Comprehension and Error Analysis." In G. Nickel, ed.. 

AILA Proceedings. Copenhagen 1972. Volume 1: Applied Contrastive Linguistics. 

Heidelberg: {ulius Groos Verlag. 1-11. 
Breirenstein. R H. (1972). "Reader's Letters." English Language Teaching 26 2. 202-3. 
Chomsky. N' (1972). Language and Mind. 2nd ed. New York: Harcourt. Hrace. fovano- 

vich. 

and M. Halle (1968). Sound Putferns of FMgltsh, New York: Harper and Row. 

Cooper. F (!972). "How is Language Conveyed by Speech." In Kavanagh and Mattingly. 



Johansson. S. (1974). "Controlled Distortion as a Language Testing Tool." In |. Qvist- 
gaard. H. Schwar/. and H. Spang-Hanssen. eds.. A/LA Proceedings. Copenhagen 
1972. Volume ///: Applied Linguistics. Problems and Solutions. Heidelberg: fuHus 
Grcos Verlag. 397-411. 

Kavanagh. J. F. and I. G. MaMingly. eds. (1972). Language by Ear and by Eye: The Rela- 
tionships Between Speech and Reading. Cambridge, Mass.: M.I.T Press. 



eds.. 25-46. 





80 Tesiinf^ l.anguoKt; Proficwncv 



Liberman. A. M.. F. S. Cooper. D. P. Shankweiler and M. Studdert-Kenneily (1967). 

"The Perception of the Speech Code. " Psychological heview 74. 431^1. 
Makkai. A.. V. B. Makkai and L. Heilman. eds. (1973). Linguistics at the Crossroads: 

Proceedings of the nth fnternational Congress of Linguists. Bologna. Italy The 

Hague: Mouion. 

Neisser. U. (1967). Cognitive Psychology. New York: Appleton-Century-Crofts. 
Oiler. I W.. jr. (1970) "Transformational Theory and Pragmatics." Modern Language 
fournaL 34:7. 504-7. 

(1971). "Diclalion as a Device for Testing Foreign Language Proficiency." Knglish 

Language Teaching 25:3. 254-9. 
(I972a|. "Assessing Competence in BSL: Reading." Paper presented at the Annual 

Convention of Teachers of English to Speakers of Other Languages. Washington. D C. 

Published in TKSOL Quarterly 6:4. 313-24. 
(1972b). • Dictation as a Test of KSL Proficiency." In Allen and Campbell., eds.. 



|1972cJ. "Scoring Methods and Difficulty Levels for Cloze Tests of Proficiency 
in Knghsh as a Second Language." Modern Language Journal 56:3. 151-8. 

(1973a). "On the Relation Between Syntax. Semantics, and Pragmatics." In 

Makkai. Makkai. and Heilman. eds. 

(1973b). "Pragmatics and Language Testing. ' Paper presented at the First foinl 

Meeting of AIL.VTFISOL. San juan. Pui;rto Rico. Revbcd and expanded version in 
Spolsky (1973). 

and |. C. Richards, eds (:i)73). Focus on the Uarner: Pragmatic Perspectives fo*- 

the Language Teacher. Rowley. .Mass.: Xe'Aburv House. 

Rand. E. |. (1972). "Integrative and Discrele Point Tests at UCLA." UCLA 7ESL Work- 
papers (fune). 67-78. 

Spolskv. B.. ed. Current Trerds in Language Testing. Forthcoming. 

Stevens. K. X. and A. S. House (1972). "Speech Perception." In Wathen-Dunn. ed.. 
.Models for the Perception of Speech and Visual Form. Cambridge. Mass.: M.I.T. 
Press. 

Valette. R. M. (1964J. "The Use of the Dictoe in the French Language Classroom- 
Modern Language Journal 46:7. 431-4. 
(1967), Modern Language Testing: A Handbcak. NJcw York: Harcourt. Brace. 



346-54. 



and World. 




87 



Dictation: A Test of Grammar Based Expectancies 81 



APPENDIX 
Table I 

Re-evaluatian of Jntercarrelanans fletiveen 
Part Scares and Total Scare an the UCLA ESLPE 1 
with Adjusted lEqualj Weightings of Part Scares (n=I02; 





Vacabulary 
(25 pis) ' 


Grammar 
|25pts| 


Campasitian 
(25 pts) 


Phanalagy 
(25 pis) 


Dictation 
(25 pts) 


Total (125 pts) 


.79 


.76 


.85 


.69 


.94 


V'ocabulary 




.57 


.52 


.42 


.72 


Grammar 






.50 


.50 


.65 


Compasitian 








.50 


.72 


Phonalagy 










.57 






Table !I 








Original fntercarreJatians Between Part Scares and Tatal Scare 
an UCLA ESLPE 1 fram Oiler (I971j~ Weightings Indicated 
In = 102} 






Vacabulary 
|20pts) 


Grammar 
(25 pts) 


Campasitian 
(25 pis) 


Phanalagy 
(15 pis) 


Diclatian 
(15 pis) 


Total flOOpts} 


77 


.78 


.88 


.69 


.86 


Vacabulary 




.58 


.51 


.45 


.67 


Grammar 






.55 


.50 


.64 


Campasition 








.53 


.69 


Phanalagy 










.57 



Er|c 88 



82 Testing Lanj^uaj»e Proficiency 



ERIC 



Table III 

IntKrcorrelotions of Part Scores and Total on UCLA ESLPK 1: 
With St'if-correl ations Removed and with Kqnal Weightings of 
Part Scones (n = 102) 



1 2 3 4 5 

Vocabulary Grammar Composition Phonology Dictation 
(25 pts| {25 pts| (25 pis) (25 pts) (25 pts| 



Total I .69 
(2 + 3 + 4 +5=100 pts] 

Total II .69 
(1+3 + 4 +5=100 pts) 

Total II! -72 
(1 +2+ 4 + 5 = 100 pts) 

Total IV .59 
(1+2 + 3 +5= 100 pts) 

Total V .85 
(1+2 + 3 +4= 100 pts) 



DISCUSSION 

Oavies: Mav I mdkti two poinl.s' The first ruKiles to lh« last point that John 
Oiler mtule dboiil hi>ih and low correKilions It s<ioms to me lh«il the cids.sical 
vieu of this would be thdl in a lest b.illerv >oii are looking* for low correla- 
tions b<:tuet;n lesl.s or .subtests, but hi>»h correlations between each subtest 
and somt? kind of i>r iter ion. Clearlv. if .is he .sii^j»esls two tests are correlating* 
hi>»hl\ with one another, this would mean that lhe\ would both be valid in 
terms of ihu criterion, as.suminji lh..l \ou ba\e a criterion. It would also mean 
presuniablv thai vou would onK need to use one of them. \'ow the other 
point, (bis [)usine.ss of the >»rammar uf expeclancv. I find )ohn OMer s com- 
ments \er> persuasive. (>learl>. what we have is a lest that is spreading; peo- 
ple \erv wuleK. He didn't lell us what the standard deviation was. but I 
wmihl su.-»pt:(>l ih..: it would be (|Uile bi>»h. and it is e.s.sentiallv for this reason. 
I think, thai he's >»ellin>» the hi>jh correlations with the other tests when he 
IJroups them lo>»elher. The diclalion lest i.s provi(linj» a rank order, which is 
what one demands from a lesl. and it is .spreadinji people out. Now. this is a 
persuasive ar>»umenl In favor of a lesl. Of cour.se it isn't the ultimate one. be- 
cause ,ne nllimale one is whether the lesl is valid. However, he provides 

80 



Dictation: A Test of Grommar Based Expectancies 83 



evidence for this \«il;ciit\ in terms of the tiddilive thing he's done with the 
other subtests But I don l understand \\h\ this has to be linked onto a gram- 
mar of expeclan(:> It seems to me that if there is a grammar of expectancx. it 
should *)e justified in its own terms, in grammatical terms And I don t know 
where this justification is. It seems to me that what we ha\e is a satisfactor\ 
test which is. if \ou like, a kind of vork sample test. I don't understand the 
connection that is being mad(». if I understand it rightly, on theoretical 
grounds, and I don't see the need for this. 

Oiler: Concerning the point on correlation. I tried to start off with what I 
think is a substantial departure from common testing theory of the 50s and 60s 
that says that low correlations indicate that test parts are actually measuring 
different skills I don't think there's any psycholinguistic basis for that kind of 
inference That is. unless there is some obvious reason why the two skills in 
question might not he r^dated. like spelling for example. We know that native 
s[)eak(»rs in many cases can't spell, so that the degree of facility with the 
language is obviously not related to spelling. On the other hand, the inference 
that grammatical [)rofxiency or grammatical skills in terms of manipulation 
of structures is no* related, say to vocabulary, seems a little less defensible. 
There are a greai many studies now that show that even fairly traditional 
grammatical tests, provided they're beefed up. are long enough, and contain 
enoujjh items and alternatives, mtercorrelatod about the 80-85*. level. I think 
I ha\e about nine tables on as many different studies in an article that ap- 
peared in the TESOL Quorter/y of 19711 illustrating that. Xow. if those tests 
inlercorrelate at that level, you have to search for some explanation for that. 
It turns out that I expected this, not on the basis of testing theory, but rather 
on the basis of what I ihink is an understanding of the way language func- 
tions from a linguistic point of view. If you've redd my stuff, you know that I 
h iven't bought the Chomsky an paradigm, but rather argue for a grammar that 
is [)ragmatically based, that relates sentences to extra-linguistic context, and 
it seems to rue that a crucial element in a realistic grammar underlying lan- 
guag<» use has to involve the element of time. That's an element that I think 
has been rather mistakenly left out of transformational theory until quite 
recently, and now we're beginning to talk about presup[)ositions and the 
notion of pragmatics So much for the theoretical justification of the notion of 
grt amar of '»xpectancv. If you're interested in going into it further. I would 
siggesl articles by Robert Woods of Harvard University, who is developing 
some comput*»r simulation models for grammars that meet the criteria of what 
I call the grammar of expectancy On the other comment, you i?skt*d about the 
sfiread on a (lictation test. Typically, on a 50-point dictation I think that 
!h<; usual siancard deviation is about 13 points. Compare that against a stand- 
ard devialion of probably 8 or 5) points on grammar tests of the sort I have 
described So li s about twine as much as on other tests, and vou re getting 
just that much more information, apparently, out of the dictation. 
Petersen: Are yt u assuming here that tes» variance incr<Mses proportionate 




84 Ttsting Longuoge Proficwncy 



to length, and is that the way you weighted these? 

Oiler: No. There ii> a tendencx for test \ariance to incic \si somewhat accord- 
ing to length, but probabU not in a linear \\a\. Howeve r. I don't mean to 
sugK^-'st that it is a proportionate. What I am suggesting is that t\picall\ the 
variance, that is the amount of spread, the tendenc\ of a test to spread people 
out on a scale, is higher for dictation than it is for more traditional tests. What 
that means in terms of reliabilit\ is that \ou can have a shorter dictation and 
get the same reliabilil\ as \ou would get with the much longer discrete point 
grammar test. 

Petersen: There's one thing I'm wondering about here in terms of your part- 
whole correlations with the total. Why didn't \ou standardize within your 
subtests first instead of multipUing b\ a constant? It would seem to me to 
be a much better procedure just to convert \our subtotals to standard scores 
before running the correlation. 

Oiler: I suppose statisticalU that would have been a more sensible way of 
doing It. I wanted the comparison with the 1971 study to be as straightforward 
as possible, and frankU when I ran those statistics I really wasn't aware of 
that statistical error. But even when it's corrected, it supports the notion. I 
guess m\ defense there would be that I'm not reUing primarily on the statis- 
tics but rather on the ps\cholinguistic argument which se^ns to explain the 
statistics. The statistics, after all. are quite reliable. They've been repeated 
man\ times now in a great many different studies. Dick Tucker got similar 
results in comparing a cloxe test, for example, with the American University 
of Beirut's lest of English language proficiency. They have 96' V reliability 
on practically every form that they've ge ner.it ed. 

Frey: The time required for scoring these two tests is incredibly different, 
because in one case it's iin objective test where time is hardU a factor, and in 
the other case it's a dictation where \ou have to hand score it. I assume. 
Oiler: It's a little harder to score dictation than it is to score an objectively 
constructed vocabulary test. On the other hand, it's a whole lot harder to con- 
struct a good multiple-choice vocabulary lest than it is to construct a good 
dictation. So I think that the two factors tend to balance out and the advan- 
tages gained in validity on the side of dictation tend to vie for that. I think 
that's part of the motivation behind Gradman and Spolsky's research. In 
investigating possible multiple-choice formats for dictation, it is perhaps pos- 
sible that \ou can objectivize the technique. This has been done very effec- 
tively with reading comprehension tests. Just because a test is multiple- 
choice doesn't necessarily mean that it has to be based on naive discrete point 
testing philo.sophy. A paraphrase matching task, for example, seems to work 
rather well as an estimation of reatHng comprehension, and it con be done 
m a multiple-choice format. The only trouble with that for classroom pur- 
pones or for unsophisticated test researchers is that it's awfully easy to make 
(1 very bad multiple-choice test. And it needs pre-testing; it needs some statis- 
tics done on it. you need to run item facility and item discrimination indices. 



91 



Diciation. A Test of Grammar Based Expectoncies 85 



and io make 'leleliuns and changeb. rewrileb. and \ou can't do thai in a (dass- 
room situation. So there are berioub disad\ antages on the side of the multiple- 
choice test as^uell. 

Sako: I vvonder if \ou could explain ho*.v dictation measures language com- 
petence? 

Oiler: Dictation in\okes the learner's internalized grammar of the language, 
anfi if that grammar is incomplete, it will be reflected in the score on the dic- 
tation If it's more complete, that too will be reflected in the score. You can 
show evidence for that b\ virtue of thr fact that native speakers nearU aU 
uavs score 100'. on dictations, or at le.Lst the ones we've investigated, and 
r on-nalive speakers tend to var\ according to their proficiency, ^o I think 
that it's an indication of an internalized competence on the part of the 
learner. 

Clark: just a technical question. I think we're concerned about the prac- 
ticalit> of our testing instruments, and certainK when laige volumes of stu- 
dents are invoked we want to devise a procedure which can be very effi- 
cientiv used It's alwavs impressed me that a tvpical dictation has quite a lot 
of dead material in it in the sense that the student is rather easily able to, 
let's say. do half of the sentence, and it's the second half of the sentence 
where the problem comes. If this is the caoe. would it be possible to think of 
some format where the stii.lent is not required to write out the entire passage, 
but onlv to write a certain portion of the passage, let's say when a light comes 
on at a critical moment? I think this might objectify the administration and 
scoring process quite a bit. 

Oiler: I don't think there's any dead material in a dicta.lion. A person can 
make errors at any point whatsoever, and he makes all kinds of creative 
errors, for example, the ocean ond its ivaves instead of the oceon and its 
vi''ays. We had one student at UCLA who converted an entire passage on 
' brain cells" into fairlv readable prose on 'brand sales." TJ.e fact is that 
listening comprehension as exhibited by taking dictation is a highly creative 
proces-4. and it's creative in much the same way that speech production is. 
Stig Johansson did do the kind of thing that you've suggested. He deleted the 
last half of a. sentence, and in spite of the fact that both Snolsky and I dis- 
agree with some of his inferences about the noise test, he did show that the 
deletion of the last half of the sentence works just about as well, and seems 
to h.ive similar properties as a U A. as does the straight dictation. And I think 
that's a perfectly viable way of getting data. 

Spolsky: Dictation tests with all their theoretical justification in practice are 
likely to be as limited as FSI tests in their necessary relevance, that is. they 
suit " parlK ular kind of language learner. The FSI test is only a direct meas- 
ure speci; ,ally for people who are going to engage in fairly limited kinds of 
conversati nal language use in particular domains. This was defined. nicely 
by rhe si tement that only standard dialects are acceptable, which is a very 
good way o' limiting the range in the same way that the dictation teat tends to 



ERIC 




86* Testing Languugc Proticmncy 



be limjteil to a liter<ilt» subject, anil iherefurt' it\s likelv lu ha niuM ii.sffiil in 
(Ifahn^i uilh (.ullL'^t; .stiiclt-nls Tlhil rcuse.s another inlore.slm^ puml. namelv 
thtit the researth with these tests is tlune with specific pufnilatmns. The wcrk 
with [ha nuise test iinil the ili(.tatu)n test lias btien dune large 1\ with foreign 
students stiulving in the United States The basic research work of the \ S\ 
interview has been done specif icallv with government emplovees. There's 
alwavs the danger that we might do the same sort of thing as psychologists 
have done for so long when thev assume that, since all their experiments 
were performed with rats or college freshmen, thev are able iu make gen- 
rrali/alions from that to other kinds of animals ami to other kinds of human 
beings 

Oiler: I don t think that's necessarilv true. Sti>; Johansson did s(jme work with 
a mollified cloze du.tation tv [)e of procetlure in Lund, Sweden with universitv 
levfl students His research was more or less replicated with a dictation de- 
sign, and similar results were achieved with a population of elementarv 
children in Sweden Spolskvs and Gr ad mans research in many ways pro- 
duces similar results to those found at UCLA with a population of foreign 
students from a tremendous varietv of l)ackgrounds, and was comparable to 
ihii results fouml [)v Tucker in the Middle Kast. Thev were; also very similar 
to results of other studic»s I have seen. rh*it is. these tests seem to have cer- 
tain remarkablv sta[)le properties, thev tend to be robust, resistant to level of 
language differences, Thev seem to produce a very high level of variance 
and to spread peo[)le out rather vvidelv on a scale. c;nd there seems to be a 
comparabilitv [)etvveen tests of vvulelv divergent sorts, that is. cloze tests are 
quite different from dictations, and vet the results are similar. So I think that 
all of these factors taken together seem to suggest that there's something 
rather fundamental that is similar about language processing in these various 
superficiallv different modes, and these tests seem to be ca[jal)le of revealing 
that 1 think thai you (.an produce a socu)linguistically significant difference 
m performance in tests Quaker showed that in New Mexico with elementary 
.school (.hildrc»n She prest nted an oral cloze test and found she could discrim- 
inate [)etvveen four major ethnic groups. Spanish-speaking, native Americans, 
Blacks and Anglos She found significant differences between each of these 
groups, but I think if she looked at the characteristics of the test, she would 
find that tfie test was performing similarly across the groups, in spite of the 
slight [)Ut significant sot.iocultural variances. So I would suggest that the 
sociocultural variable is [)4obably a significant but small factor, and the l)ulk 
of the variance is attributed to certain rather robust propjrties of the tests, 
Davies: I'm interested m hov\ the dictation text was selected. How do you 
Sample for your dictation? 

Oiler; In this [)artu;u!ar study tht; chctation wa.s stdected on the basis of Lois 
Mcintosh's mtuition. She assumed that there was a difference between hu- 
manities and sciences that might be revealed in dictation. Therefore to coun- 
tf*r balance for that she selected fi passage from the sciences and a passage 





Dictation /\ Test of (Jrammar Based Expectancies 87 



from thu hunMnilies I dul ti liltlu rebUtirch on Ihiil Idlur v\ilh some 359 incom- 
ing sludunLs I bcletltnl a piibsagu from tin oluniunlcirv Iuvl'I ihtil was hor- 
romlouslv simple Then tinotluji piissage vviis s^ileulod from ihu ]u«in Pr«inins- 
kas grammar r«vi(?v\ It vas a shghtlv fiighor luvtd of language . ihuru weru 
more tum[)lictitud sunluntcs tirul so forth Thun tinolhur passage Wcis suleclud 
from a readur of colhjjje lined lilurtirv [)icces. a much moru complicalud 
[)assagt* 'i'hosu three diLlations vvuro all giv un to r<indom Scim[)lus of the Stmie 
[)o[)ulatU)n. The perfurnuince on the three dicttitions in terms of correlation 
with other exlerntil Viilidtitrng criteria v\tis almost the Scime in spite of the 
v\idelv divergent lev(ds of difficulty So agtiin. the (jvidence suggests that it 
doesn't makt.' ii whole lot of difference whether \ou t«ike a fturly hard [)as- 
Stige, a fairlv easv one or one somewhere in the middle The test seems to 
perform similarly, and the correKitions vou get with external validating cri- 
t**ria are similar l \ui Priinmskas p»isstige and the other passtige correlated al- 
most identicallv with each of the other external criteria 

Davies: Hut doesn't this dofiend on the level of vour student? I mean, if you 
take ton easy a passage with an adviinced group, your metin score goes right 
up. 

Oiler: You're hack to the dead data in dictation If a fairly advanced student 
wouldn't make any errors, granted he'll be off the sc«de. But the fact is that 
fturly advanced stuilents niake errors in fairly simfile [lasstiges. otherwise you 
wouldn't g(3t that kind of correlation. 

ScoU: I'll like to know whether the students whom you tested were taught 
tlictation as a liMching technujue and whether or not they were therefore 
accustomed to taking dicttition. It's been my experience that a student who's 
been taught dictation can do much better on a dictation test, 
Oiler: Rebcjcca Vedette seemed to think that it made a difference in her study. 
All I know IS that in the several studies that were done at UCLA by Kern in 
33 cltioses of beginning, mtermediate tiud tulvanced cKissiiS of English, there 
seemed to be no prtictice effect. Thtit is. you give people dictations 15 or 20 
limes during the course of ti (jutirter and they don't seem to do much better 
at the end of the tjuarler than they did at the beginning. The test seems to 
resist the practice effect. Perhaps one vvtiy of attacking that would be to 
repltice the dictation with [Jassages thtit were demonstrably similar in dif- 
ficulty level, m terms of mean scores of simihfr populations. And then do a 
pre- and [losl-tesl evaluation and see if peo[)le improved. There might be a 
slightly significant improvement, but it's again going to be a very small p<irt 
of the total variance unless current rjsearch at UCLA was all wrong, 
Scoll: Is the undc^rlymg [iremiso for the use of dictation tests thtit natives tend 
u) perform perfectly tind near nitives or non-natives tend to perform at some 
point down the scale from ^hat? If that's taken as ti bas'o [)remis \ then I think 
there may be some problems with trying to apfily tha to learners at the op- 
posite end of the sctile, namely beginning college learner.^) of a second lan- 
*;uiige. 




88 Testing Longuage i*rofi{;iency 



Oiler: This is pari of the argument in favor of the vali(lil> of the lest. One of 
the surprising things that >ou find m discrete-point tests of various sorts is 
that if they're sufficiently cleverly constructed, as some discrete-point tests 
are. non-native speakers will do slightK hetter than natives. That is prohablv 
because of a tendency to emphasize certain kinds of things that people con- 
centrate on in classroom situations What we're doing is teaching a sort of 
artificial classroomese instead of teaching the language. So I think that it's 
very important to test the examination against native speaker performance. 
Surprisingly. t«jsts that have been in existence for some time now have not 
consistently used that technique to test their own validity. For example, that 
is not a standard procedure in the development of the TOEFL exam. I think it 
should be I think it ought to be for exams at all of our institutions where 
we're trying to measure language proficiency. If the native speaker can't do 
it. then ifs not a language test, it s something else. 

Cariier: Jolin. you talked about how the selections were chosen, and I'm 
inclined to believe that experienced people can. in fact, rank prose by dif- 
ficulty without the use of the Flesh scale, although Flesh or Lorge or Dale or 
Chall might be useful in this respect. Hut I don't think we ought to slide over 
the point (juite so lightly as you seem to I doubt very much whether you're 
presenting the whole range, because I can think of a paragraph from Kor/yb- 
ski's Science and Sanity, for example, and that's part of the language too. 
And to use your intuition only is to introduce the bias that you're going to 
give the students something that you think they can cope with. I don't think 
you're really jastified in saying that the difficulty level doesn't seem to be 
all that important. 

Oiler: I think you're probably right. If you really stretched it to its limits and 
presented something like e e. Cummings* "up so many floating bells down 
anyone lived in a pretty how town." or something like that, then you're out 
of the norms of language usage. If yc j present John Dewey's prose. I think 
people would have a little more trouble with that than they would with 
Waller C.ronkite. The [)oinl is that people are abltj to make fairly good sub- 
jective judgments. Language teachers can judge pretty well what level is ap- 
propriate to the students that they're teaching. If you're trying to find out 
if these people can succeed in a college-level course of study, thsn the ob- 
vious material is college-level text and lecture material, the kind of thing 
th< v*re going to [lave to deal with in the classroom. Here you come back 
to Spolskv's point. If you've got a different kind of task, if they're going to 
have to drive tanks in battle, then the sociolinguistic Vijr'ables would dictate 
a different kind of level, a different kind of a task perhaps. I think that peo- 
ple (.an make pretty good subjective judgments, though, about levels. Much 
better than we've thought Much better than the Dale and Chall and the other 
formulas that are available. Our own suljjective judgments are usually super- 
ior to those kinds of evaluations. 





Contextual Testing 

John Bondaruk, James Child and E. Tetrault 



INTRODUCTIOM 

Like many other governmental and academic institutions, the Depart- 
ment of Defense is in chronic need of improved testing instruments to 
measure language aptitude, achievement in language courses {par- 
ticularly those given at the Defense Language Institute), pnd language 
proficiency. Because these tests mu* I be used with fairly large popula- 
tions, they must be reasonably easy to administer and score. 

While our present work extends into such areas as aptitude testing 
and improved test design for me?"uring aural comprehension, the 
tests described in this paper are aimed at foreign language reception 
skills in the written medium, including translation tests where trans- 
lation itself is the desired terminal skill. 

The test forms which are presented below are referred to as **con- 
textual tests," because test items are presented in natural discourse- 
length contexts rather than single-phrase or single-sentence frames. 
While the use of such testing methods appears to be rare in the United 
Stales, we can make no claim for uniqueness. Cloze tests have been 
around for a number of years, and they are in some ways similar to 
the tests we are developing. Language teachers must surely have 
worked with drill and test formats which resemble contextual tests. 
We have, however, developed item selection and control methods and 
approaches to item analysis which may very well be new. In any 
event the final proof of the effectiveness of testing instruments and 
procedures is not their novelty but their predictive validity; that is, 
the degree to which they allow us to make the proper hiring and job 
placement decisions which are vital to the effective use of langua<^e 
talent within the Department of Defense. 

Before presenting a detailed description of test forms and proce- 
dures, we shall devote some attention to the theoretical considera- 
tions which apply to this form of language testing. 

LANGUAGE TESTING AND LANGUAGE THEORY 

There are two general ^.pproaches to language testing used within the 
Defense Department. The most commonly encountered procedure is 
to teot so-called discrete points. Following this procedure, a test writer 
first draws up a list of "language facts" which the examinee is ex- 



ERIC 



89 

96 



90 Testing I.anxuugf; Proficiency 



pected U have acquired (forms, v.ords, rules of usage, idioms, and the 
like) and thon w:ilos items against this list. With this procedure a 
significant (juantit\ of language components can yield a fair rep- 
resentation of the language as a whole. There are usually 50 to 100 
such problems on the test, no ono of which is linked linguistically to 
an'^ other in tht; format, but all of which taken together presumably 
offer a valid sampling of t ie workings of the language in question. 

The second approach to measuring language proficiency calls for 
the examinee to show an overall control of grammar and lexis in con- 
text and to do bO b> performing specific language tasks, by trans- 
lating or sumrrari/ing a passage, by answering questiois on its con- 
tent, by takin^i dictation, etc. These tasks are generally performed and 
then judged according to previously established criteria (time, quan- 
tity, quality) and are referred to at DOD or DLl as "criterion-refer- 
enced tests." 

Doth kinds of tests have honorable histories and when well de- 
signed have much to offer. The first allows easy testing of surface 
morphology, including both affixes and bases. It also offers limited, 
bu( efficient, ways of exploring the examinee's competence in dealing 
with, for example, tense or aspect systems or the rules of embedding 
with a particular sentence. Thus, the test taker may be asked to supply 
or identify, in frames of phrase or sentence length, a plural ending for 
a noun or a past tense form for a verb, a subjunctive form, or a verbal 
nominaii/ation in a subordinate clause. IJowever, it is difficult if not 
impossible to test affixes and words which link clauses and sentences, 
or the use of pronouns and other PRO-forms which refer backward or 
forward within a discourse. 

The second type of test, on the other hand, calls for the examinee 
to give evidence of his overall control of a foceign language passage. 
Its strength lies in the fact that it does reflect a natural use of lan- 
guage. H wever, a great many reading comprehension tests cr transla- 
tion tests eithei concentrate on subject matter knowledge which the 
examinee may or may not possess, or are constructed so loosely that 
precise determinations of language problems may be impossible. T'his 
IS es()ecially the case with those individujls whose performance falls 
into tho middle or lower ranges. 

l.anguage testing reflects language teaching, which in turn reflects 
a particular model of competence, i.e. a model of internalized gram- 
m.ir. It is probably (rue that most language courses are derived either 
explicitly or implicitly from a "systems" view of language. Ferdinand 
de Saussure and his successors expanded at length upon the distinc- 
tion which is at the heart of the two main testing approaches de- 
scribed above. Louis Hjelmslev made explicit the point that there is 
no process in lan;]uage without an underlying system. To the extent 




97 



Contextual 'iV?sting 91 



that such a statement is of practicdl value. cin\ derived system ought 
to be applicable to most "processes * or texts. Thus, teaching, refer- 
ence, and testing materials should be organized to convey the system 
in context-sensitive terms. 

This is rarely the case in practice. Courses provide rules which 
allow the learner to produce sentences like "John kicks the ball" or 
to acquire some general idea of what such sentences might mean if 
someone else uses them. Older grammais of all widely studied lan- 
guages have had these subject -fverb+ob/ect sentences in abundance. 
\cwer ones often have them loo, but with a new set of grammar ter- 
minology. In the case of "John kicks the ball. " all we have is a gram- 
matical formula. An adequate grammar would speoifx at least a time 
adverbial to make ihe sentence plausible and might also provide for 
place and mann i adverbials as well: "john kicks the ball in the 
school-yard every day. ' 

The task-criterion approach to testing may very well be an effort 
to move aua\ from testing grammar systems as exemplified in formu- 
lae like *''John kicks the ball." However, it is not an attempt to deal 
directly with process as such, but rather with what is produced, the 
results of process. If someone is able to perform a given language task 
within a fixed period of time with an appropriate percentage of the 
task done correctU, then we are justified in inferring that he controls 
some of the processes of the language. But. as we noted above, such 
a procedure is less effectiv*^ w^ith examinees in the middle and lower 
ranges, those individuals who do not yet control the language to a 
significant degree. 

We are lr\ing to i^jveloj) a third a^)[)roa(:h. one which addresses a 
different (.onstrucl of [)r()fi(:ien(A . something akin to what John Oiler 
calls a 'grammar of exi)tM,lan(:\.''* In ver\ jiieneral terms this involves 
the al)ilit\ to anlioipate ctrrtain elements of language discourse. There 
is a grovving bod\ of o[)inion to ihe t^ffect that this abilil\ is doseU 
related to both re(.e[)tion and production skills, in short to global 
profic/iencx .2 The [)arti(:ular test forms which we have devcdoped and 
some of iheii chaiacteristics will be the subject of the next part of 
this [)a[2er. 

CONTKXTIJAL TKST TORMS 

Contextual te.sling re[)resenls an cittem[)l to measure someone's abilit\ 
to t\\}\}\\ his knowlirdge of grammar-hrxis s\s(ems to a s[)ecifi(: [)oinl in 
a dis(.ourse. The li;st forms usualK retpiin; (hat deleted material be 
restored. However, the forms shown below are not true cloxe tests, 
in that the deletions in ihem are nol svstematicalU random, but [)lan- 
ned. Authentic texts ar»; used (as oppost?d to serntences generated as 
Q '\\am|)les of grammar or usage). he(.ause such material is much more 





92 Testing Language Proficiency 

likeh to provide ihj fret|iH?iu;ies, pdlleinin^, ami conslilueni order 
lypical of a^jiven lan^iua.ue style-level and re^isler. 

The first jiroup of forms shown (labelled Form A, Form B, and 
Form A/B) an? directed at level 1 re ading; comprehension (Forei^jn 
Service Instihite R-2|, 

la Form A hi<>lw^uiundciac\ points in (he string (affixes, function 
words, etc.) are suppii^ssed cmd must be suppl ied in order to recon- 
struct an inte>:ral text, H» er\ effort is made to avoid multiple solu- 
tions, but trivialh different responses (e.ij. s\non\ms) mav be inevit- 
able at some junctures. Fo/ e.\ampie. note the last blank in this sample 
Form A in Kn^jlish: 

Form A fKnglishj 

Oil-import- nahon- end- their U.S. meetin*j. 

a>iree- ____ to me»*t producer- ener>»y confer- 
ence ailopted most Washin>:ton- >ieneral proposal- 

™_ cooperation assure adetjuate fuel suppl- 

and to lr\ to jjet oil price- reduc- Trance. 

forced the meelin^j third day. si>:ned final com- 

muni(|ue. objected many provisions. 

Both "the" or "its " dm fill the determiner slot in the last blank of 
the sample. 

In a French test sample onl\ function uoi ds are suppressed: 

Form A (French j 

L ami . ' Italie parcourait des avani guerre ses 

routes el . cites, eprouve depuis WM) une sensation nouvelle. 

pays reste aussi attiranl ' autrefois, la misere est moins 

)»r«mde. Mibm avec ses c:ommerce»s et ses lissages. Turin ses 

usines automobiles (it. d' ^ maniere plus genera*)., les villes 

Nord sent de plus plus prosperes. Si Ton descend vers 

Q — Sud. la pauvrete commence reculer. On peul plus 

ERIC 

0,9 



Contextual Testing 93 



ecriro. conuin? - h»iuhMiiaiji . _ jiuern?, "Ui Christ 

* t\sl iiir*3h? a Kboli.** 

The firs! bhink in (he secDiul senlence. "LE pays . . is the 
{\\nt uf response which (his formal hanilhrs tMsih. ahhnu>>h i( uoiild 
he diffitult (o tdi(.i( in a niuhiphf-Lhnice test (with feu \iable clistrac- 
tors and a tui)-sentt»n(:e u)nte\t). Other rcfsponses co\t?r a r«inj»e of 
simphj paralhdisnis. e.*i. 'Milan avec ses commerces et sos tissa^es. 
Turin AVKCJ st»s usmes «iutomohih»s . . common ()hrase and 
clause patterns, e.^. *. . . le villes DIJ Nord . . and . . aiissi 
altirant QlJ" autrefois ...*": and bound forms. e.j». . . de plus 
EN plus . . /* and . . la pauvroie commence A reenter 

*riu» nt*\t sample demonslrates a form of cuein«i which is external to 
text itself. In this c»isj it is simpK an enumeration of words de- 
hM*;d from the text withe ul an\ indicMtion of the* nunber of times a 
word «U)p» artMl in the orj«»inal. External (:uein>» is one obvious wa\ to 
control Iht* difficultv level of this kind of test, and it can take any 
number of forms, not trxc.ludin^ an English translation or summ«ir\ of 
the passai^e to be restored. 

Form A fSpanish| 
. industria del petroleo ^as Union 

Si)vietica desarrolla ritmo acehrrado. Obreros. in^en- 

ieros, tecnicos decidieron extraer 8 millones toneladas 

petr()l(M) . encima lo establecido ^ plan 

(}uin(iuenal de fomonto economia de la URSS. 

. - . artos lOBH .... , 1970: y ese compromise. lo visto, 

^ser<i cumplido. 

Thijsn items have been thleied at hast once from the text: 
a. al. de. del. el. en. la. los. para, por, se, y 

In the Chinese Form A,^ certain function words (single characters) 
v\c»rt. deleted and rephiced with numbered blanks. A list of the items 
suppre.s.seii (and a few (»xtra ones) is ^iven to the ri^ht of the text, and 
the examinee is asked to match the number of the blank with the 
appropriate character in the Hut. 

The final s«imp|e Form A is in Russian. As it is shown below, it 

100 



94 Testing Language Proficiency 

|)resenls i\ number of cillonu)r[)hi(. [)iobh;m.s at the b<ihe + <iffix junc- 
ture. This format uas not enlirelv satisfactory particularU in the \\d\ 
function words were elicited: 

Form A /Russian) 
B MocKBe HeflaBHO npeflcraBHTenbCTBo 

OTKpblTbCfl 

OPr- "flOMHe 6aHK AT". 

OflMH— PREP KpynHeMUJMM— 6aHK 

Ha COCTOflBUJeMCfl II 

BbiCTynaib PREP coBercKMM 

)KypHanMCTOB , O. X. 

MHOcrpaHHbiM n pecc-KOHc})epeHL^fl 

y/ibpMX. npeAceAare/ib r\paBnem9\ dioro 6aHKa, sa^sm, mto C03AaHMe 

npecjieflyer Lje/ib 

npeflCTaBMTe/ibCTBO flan bHeMUJUM-paaBwrne 

y>Ke 

CymeCTBOBaTb — SKOHOMMMeCKMM - KOHXaKT (pi.) PREP 
HaUJMMM 

flBa — CTpaHa 

In the second format used experimentallv (desi^»naled Form B). the 
items suppressetl cire content words (nouns, adjectives, \erbs. non- 
redundant prepositions, etc.). .\t the hnel which we were lr\in>» to 
measure (R-2). we fountl (hat consi(h?rable prompting was nece.ssarv 
in the form of le\i(,al choices ^i\en beh)U ihe text, morphophonemic 
clues. lexemic chu\s. and a^ain [)erha[)s even Iranslations or si. 
marios of the original. The foHowin^ Kn^lish example dlustrates rhe 
.i»<»nernl format: 

Form B (Knglfsh) 

_ _ — _ -t In . _ -s and some , , _ -s of Ihe is 

-<?d. . -s -d. The was -ed 



last after a at Ihe in 

nouns \Vashin«zton. official. Ilou.st;. [)raclice. seiialor. member, 

mail. (?xplosion. emba.ssy. summer 



Ml 




Contextual Testing 95 

'*a{i/ectives:" con^ressiondl. [)()slcil. letter-bomb. British 
verbs: start, send. X-ray. say 

Form B is in a sense a mirror ima.ye of Form A, in that free rather 
than bound morphemes are removed from the text and placed in 
al|)habetic or random order beneath the text. Extra words ma\ be 
added to this list as lon«j as the\ are not equalh [)lausible in context. 
The examinee is asked to suppU members of open systems, as op- 
|)Osed to the largeU redundant members of closed systems elicited in 
Form A, and. therefore, he is required to reconstruct the original 
messa»»e of the text. 

;\ sam[)le Form B in French is shown below: 

Les que [)ose notre defense sent d une ampleur. La 

dont nous les rcsolvons _ d'une exceptionnelle. 

Xous ^^.^^ en un tf,*m[)s oii les menaces rev(5tir les formes les 

plus r les plus terrif iantes. De Lei est apparue V de creer 

ci une recente. la [)residence du ^'eneral Buis, "la 

fondation les etudeo de defense nationale. avaat-lout un 

de reflexion, du suggestion, qui se d'inciter les Frangais en 

. r intellectuelle en li [)ar leur de 

conscience. [)ar travaux. au succes de la t<1che qu'il s'est 



fixee. 



bhxjuer 
date 

democralio 

elite* 

(lire 

exlraordinairci 
gen(iral 
im[)orlanc(i 



inaltendu pouvoir (v.) 

lour [)rise 
maniere prol)leme 
o[)[)ortunile proposer 
organisme societo 
[)arti(:i[)er sous 
[)articulier traverser 
[)our vivre 
In this lest, as it is shown. affix(is are not given. I)ut double lines 
in a blank intliuale that some: operation must ho [)erf()rme(l on the 
item selected. There* are a few (»xlra items given in the lists. The de- 
gree to which Item seleLtU)n is (.uetl (.an be illustrated with .such exam- 




102 



%' Testing? iMnguuge Proficiency 



pies as "opporlunite" in IIil siring . . Du la usl apparue 1* do 

creer . . /* where the clues are noun, feminine noun, and teminine 
noun be^iinninj, in a \ouel. jjuing choices of "elite. "importance." 
and. of course, "opportunite. " Other forms of cueing are "probleme * 
in co-occurrence with "poser" with the plural form obvious after 
"les" and such common patterns as "...en GENERAL » . . . en 
PARTICUUER...-. 

The final sample Form B is in Russian. 

Form B (Russian) 
Ciano y)Ke :b|M, hto OflHO :§ )KaTBbi 

Bcex, c/iOBHO co/iflBT HO Tpesore, ler Ha hopm, 3to b 



KaKOM-TO iM o6i»PCHMMO. Xne6-Haiue lO, 

ero Henb3P ynycTnib. H. eciecTBeHHo. -eic^ oco6eHHO 

BblCOKaH lb OT J 'Oro 



Bonpoc B TOM. KaK OHa -eicfl . 

aKTMBHOCTb HOflHUMaTb paSOTHMK 

6oraTCBO npM6jiM)KeHMe coBeujaHMe 

rocyflapcTBO npuBbiHHbiM creneHb 

Ka)KflbiM npoPBJiyiTbCP Tpe6oBaTbC9 

.\1an\ of the r(ispun^(js in the Russian Form B are more or less 
mechanicalh controlled (e g, in the second blank on the first lir.e the 
cues are. no.m, ntMiter noun, and neuter noun ending in-e leaving »he 
examinee with oidv two choices). However, later on in the sampi^? 
IhtJrtJ are (wo blanks marked for reflexive verbs (with the morpheme 
siring -eiCP) .uid lliure are also two such verbs on the list. These verbs 
appear at firs! to be interch.ingecible in the text, but semantic consid- 
erations, [)articularlv idea secjuenr^ng. weigh heavilv in favor of using 
Tpe60BaTbCy? first .indnponBJiPTbCR^it the end. It appears more reason- 
able U) sav that a high l(»v(d of aclK-cness is required (Tpe6yeTCfl) of 
(ne!v vvorker, the [)r()blem is the rurm which this activen(,\ss takes 
(npOPB/lPBTCP). 

The final tvpe of conttixtual test which has been used for R-2 eval- 
uations combines the features of both Form A and Form B and is 
designated Form A/B, Tht; ex.ini[)h; given on the next page uses the 
same original text as the Frtjnch Form A on p. 92. The responses 
elicileil includtj function words and content words, and it is nece.ssarv 
to perform various operations on some of the items which are se- 

er|c 



Contextual Testing 97 



lected (such operations as * inflection, combination, and reduction). 
The onI\ external cueing use in this sample test is a listing of deleted 
items. Once again double lines under a blank indicate that one of the 
operations alluded to above must be [)erformed on whatever has been 
selected. This listing contains onK those items actually appearing in 
the original, and some of the v\ords have numbers after them to indi- 
cate the number of times the word appeared in the text. 

Form A/B {French} 
lies = pose notre defense ^ d'une extraordinaire 

ampleur. La dont nous les — est = importance 

=, Nous - en un temps les menaces rev^tir 

les formes les =z= les terrifiantes, De est 

apparue I' de creer. a une recente. sous la = 

general Buis. "la fondation pour les etudes de defense nationale." 

avant un organisme de suggestion, qui se == 

(Pinciter les Frangais en I'elite intellectuelle en 

participer par leur prise de conscience. travaux, ^=z 

succes de la Idche qu'il se'est fixee. 



a (2) 


leur 


probleme 


date 


maniere 


proposer 


de (3) 


opportunite 


que 


6tre 


oii 


reflexion 


exceptionnel 


par 


resoudre 


general 


particulier 


tout 


inattendu 


plus (2) 


un 


la 


pouvoir (verb) 


vivre 


le 


presidence 





Thus far the test samples shown are supposed to measure R-2 pro- 
ficiencv. Whether thev do or not has to be established empirically. 
The same methods can be used for R-3 testing with different deletions 
and cueing. There is. however, one additional test format which has 
been used for diagnostic purposes where translator performance is 
involved. This is a co-called translator readiness test (TRT). 

s 104 



98 Testing Languoge Proficiency 



The TRT answer form is based on a translation in English of a 
foreign langihige text. Extensive tl'jletions have been made in the 
English, and the examinee is asked to restore the deleted material 
at:i.ortling to information available t:ontextUtdl} (undeleted words, 
number of blanks, puni'tuation. etc.) and also aceording to the basic; 
message oontainetl in the foreign language text. The test is not aimed 
specificaih at foreign language reading comprehension, although the 
t xaminee is obvioush unable to function without this skill, nor is it 
intended to serve as a measure of English writing skills. The test's 
main goal is to revecd the kind of a mind-set which allows a potential 
t.anslator to deal with nearl> identical propositions realized in sharp- 
ly diffeiing surface forms. 

There .)re two sample tests on the following page. In one. the 
"source" text is nothing more than an En^^lish paraphrase of the 
text on the answer sheet. The second is an excerpt from a TRT v;ith 
a Russian source text. 

TRT (English Source) 

B\ imposing on itself far more rigid standards than its allies have 
done, the United Slates has simpl> shut itself out of markets without 
anv impact on Communist countries or gain to national security. 

TI e - ^ ha^ on standards are 

those its themselves. 

so has only make 

on or for but — 



also Itself . 



TRT (Russian Source) 

B KOHLie npomnoro eeKa OpaHLjMfl co6Mpanacb oTMeMBTb cToneine 
6yp>Kya3HOM peBonioLiMM 1789 rofla. flo aioMy cnyHaK? 6bmo peujeHO 

OpraHM30BaTb BCeMMpHyK) BblCiaBKy H npHflyMaib HT0-HM6yAbHe06blKH0- 

BeHHoe, yHMKa/ibHoe. 

At the end century France con- 
sidering the of the of the 1789 bougeois revolu- 
tion. - - tlecided to this by a 

and up something unusual, even unique. 

It should be noted that testing translation skills beyond the **readi- 



ERIC 



105 



Contextual Testing 99 



ness" stage requires performance unhamjpered by structured forms. 
Traditionally, the evaluation of such performance has relied on intui- 
tion, i,e. on the evaluator's overall impression of the accuracy and 
appropriateness of the translation. We have, however, developed a 
grading scheme based on a case-grammar approach which can be 
made to work reasonably well. Graders working independently on 
the same translation test usually arrive at raw scores (expressed on 
a base of 100 points) which differ from one another by no more than 
five points. 



PROPOSED ANALYTIC PROCEDURES 

The effectiveness of a lest depends on the characteristics of the ele- 
ments or items which make it up. A test score is the resultant of the 
validities, reliabilities, and intercorrelations of its component ele- 
ments. It is at this point that past efforts to develop ^\oze tests have 
apparently run into trouble. 

A preliminary step in the evaluation of our contextual tests will 
be an empirical analysis of their internal characteristics using item 
analysis and cluster analysis procedures. This will be done to check 
the match between our expectations of the operating characteristics of 
our tests and how they actually function when used with a linguist 
population. 

Item analysis procedures assess two statistical characteristics of 
individual test items which are of great interest to us. The first is the 
difficulty level of each item being used; the second is the degree to 
which euch item differentiates those who are high from those who are 
low in language proficiency. 

We are currently awaiting receipt of data from the administration 
of our contextual lesi^. which were administered along with conven- 
tional proficiency tests to samples of 100 Portuguese, 100 German, and 
100 Russian language najors. An example of item analysis results 
based on the responses 25 individuals taking the Portuguese test on 
two items from one contextual subtest appears below: 

Constructed Responses 



ITEM 




0 


QUE 


SER 


UM 


SENDO 


OMIT 


4. 


Diff. 


0.520* 


0.200 


0.160 


0.040 


0.040 


0.040 




Disc. 


-0.023* 


0.012 


0.141 


0.057 


-0.345 


0.057 






AR 


ASSE 


ADOR 


DER 


ANDO 


OMIT 


5. 


Diff. 


0.360* 


0.240 


0.160 


0.080 


0.120 


0.040 




Disc. 


0.489* 


0.118 


-0.092 


-0.284 


-0.571 


0.057 



*indicates the response keyed as correct. 

The context in which these responses were given is: 




100 Testing Language Proficiency 



algu- anos, a iddia Brazil exporl- 



aco parecer- if_ absurd- 

These data suggest that item 4 is a very poor item. The jorrecl re- 
sponse "O" shows a negative item discrimination value and incorrect 
responses "QUE." •'SER." and **UM" show positive d scrimination 
values.. Item 5 shows good discrimination for the correct response 
"AR," but the incoirect response "ASSE" also shows positive dis- 
crimination value. When items 4 and 5 are considered as paired items, 
with the two elements dependent upon each other, two interesting 
relationships emerge. First, the examinees who produced both '"O" for 
item 4 and "AR" for item 5 turn out to be high scorers on this section 
of the test. Low scoring examinees who produced the correct response 
"O" fo:* item 4 antl thereby caused the negative discrimination value 
of the "O** response did not see the relationship between 'O" and 
"AR** and consequently did not produce the correct response forJtem 
5, The statistical characteristics of the linked response of items 4 and 5 
indicate better item discrimination than was achieved by scoring 
these two items separately. The second interesting relationship ap- 
pears when the positive discrimination index for response *'ASSE** 
on item 5 is considered. That some examinees who scope liigh on the 
complete subtest wrote "ASSE" rather than the correct response sug- 
gests the existence of a possible link between "ASSE" and a response 
produced in item 4. "QUE" showed a positive discrimination value by 
itself: when linked with "A£SE" the conibined response produces an 
acceptable pattern. Two acceptable response links therefore exist: 
the"0. . . AR" link and the "QUE . . . ASSE" link, an assertion which 
was confirmed by native speakers or by individuals with near-native 
fluency. 

The test samples which we expect to have (300 examinees in three 
languages) will give us the opportunity to apply item analysis as well 
as cluster analysis techniques to a variety of contextual teSts. We will 
also be able to correlate contextual tests with a conventional discrete 
point test (multiple-choice) in one language and with translation 
exercises in all three languages. We hope to produce not only better 
R-2 tests in three languages-German, Portuguese, and Russian-but 
also better approaches to writing and validating language proficiency 
tests in general. 

NOYES 

1. Oiler. John W., "Pragmatic Language Testing." Longuoge Sciences 28 (December 
1973). 7-12. 

2. Fry, D. B., "Speech Reception and Perception," in ]ohn Lyons (ed), Meiv Horizons in 
Linguistics. Middlesex. U.K.. Penguin Books. 1970. 30-32 and 47-50. 

ErJc 107 



Contextual Testing 101 



3 The sample of the ditnust* Form A w<in dohMed because of technic<ii difficulties. 
DISCUSSION 

Davies: I'm pariicuKirU interested in the method of item analysis that's just 
l)een described, and I'd like to ask two (jiiestioiis about it. First of all. did you 
actually quote the discrimination figure for the linked analysis? 
Bondaruk: No. I didn't j»ive you that value, but it's a high positive value. 
Davies: The other (juestion is. %vhal is a (duster? It seems to me that this raises 
the whole question of the seijuential nature in a cloze-type test of the dept^nd* 
cncies between one item and another. Items typically are regarded as dis- 
crete, and as you have pointed out. this is not necessarily the case in a clo:!e 
lest. Where do you draw the boundary for your cluster? 
Bondaruk: Let me make an initial comment. One of the things that has oc- 
curred in the development of these specific tests is that my linguistic col- 
leagues have attempted to develop a variety of structures, and in selecting 
the l)lanks. unlike clo/e technicjues. they've not taken away every nth word, 
they'xe set up certain specific pattern striictuies. In the analytic phase, what 
I'm attempting to do is to validate, if you will, their theoretical position in 
structuring those b\ comparing them against actual empirical data. Now, there 
mav be situations where, using a straight statistical approach. I will surface 
home combinations, some linkages that were nverlooked in the development. 
I also may. once ! coni|)lete tny analvsis. present problems to my linguisMc 
colleagues who felt very strongly that certain structures were dependent upon 
other things. I)ut empirically with my subject sample, they didn't perceive this 
linkage, they didn't react to this specific linkage. Consecpiently. there may l)c 
something wrong with the way the item was structured, or it may be going 
l)ack to the drawing hoard in son^j theoretical aspect from a linguistic sense. 
Oiler:. I just wonder how you happened to notice this particular linkage. 
Presumably you were clued into something strange when you noticed the 
negative discrimination values in certain responses and not in otheis. But I 
wonder if the explanatum wasn't really produced more by your knowledge 
of the language than )t was by the statistics, except for the fact that the statis- 
tics suggest that there might be a problem here. 

Bondaruk: Let me say first of all. the only Portuguese I know is what I see on 
the signs when I'm watching the Olympic games. And you're quite correct, 
looking at what I've listed as item 4 posed a problem. If you ioo^ at item 5. 
however, that positive discrimination index for asse was a problem also. So. 
in my ignorance the only (jiiestinn I would ask is. "Is there any relationship 
l)etween the higher scores on the tests picking asse and this odd. not only 
negative discnnunatton but positive di.scriminatiou. index for three of the 
other Items in item 4/** I would have aj ked. "Is asse linked with que, sen or 
um. vv» . w knowledge of the language.'" It turns out that when asking that 
question to a linguist. *he linguist replies. "Oh yes. que asse is a legitimate 
construction ' .\ow I accept the linguist's statement here. So I'm just suggest- 



ERIC 




102 Testing Language Proficiency 



ing that in the keying of this, it's conceivable, since they are constructed 
responses, that there may be acceptable response patterns that the linguists 
haven't picked up. And so from this analysis you'll pick up ihe key. The sec- 
ond element is that generally in this specific format, we're dealing with only 
a limited number of possible constructions that would be acceptable, and 
hopefully this statistical procedure will identify most, if not all, of those that 
are significant so that we can develop a more multiple key for analyzing this 
test and hopefully the multiple key will include the linkages. 
Nickel: Xty questions are directed toward Dr. Child. Is the term non-con- 
textual an official term in this country, and is there any non-contextual type 
of testing? 

Child: .\o. there isn t any receivejd tradition here, contextual vs. non-con- 
textual Everything is sooner or later contextual. Generally, something is 
going to be rooted in context, so it becomes a matter of degree. Hov ever, I 
think you can make a fairly strong case for an analysis of a passage in which 
the external world or the situation more or less impinges on the language 
string you develop and the items that you suppress. Everything is going to be 
ultimately semantic, hence ultimately 'lontextual. But there does come a 
point where you say. "john kicks the ball/* and it has some kind of meaning, 
but not any real meaning much beyond a formula. So in that sense I * ould 
call that a non-contextual situation. 



Nickel; My second question. In our research on error analysis we are very 
much interested in discovering relations between certain tests and certain 
errors. There are also certain t; pes of learners and testees. My question is. 
"Have you discovered any relationship between certain types of learners and 
:c ??es and certain types of test frames?*' I'm sure not all of us are specialists 
in cri» .s'vord puz/.les. I think that maybe certain types of testees ^re better 
at solving this kind of contextual test analysis. 

Child: It h be that a few of those on top are the crossword pu/.*/le peo|)le. 
and a few i\ the bottom are brilliant linguists who can't handle this kind 
of format, i I'.on't think this is a major problem, however. 
Bondaruk: I ihink the critical issue here at this moment is that weVe been 
working very hard on the theory and developrnent of the instrument. As Tve 
suggested, we haven't yet gotten back our field trial data. I talked in my por- 
tion of the paper aboj^t the preliminary steps and analysis, which is to exam- 
ine the internal consistency of the specific testing instrument. There's one 
final step that we must have also, and that is. we will have to examine these 
test scores and the operational characteristics of the test against other criteria. 
The other criteria may be from a construct validity point of view. But ulti- 
mately for our use of the test, we will have to look for some criteria of per- 
formance. successfully achieving what our requirements are for the language. 
But at this moment we've developed the instruments, we've sent them out for 
trial, but we haven't analyzed the data yet. 



Hindmarsh: I'd like to ask a fairly simple question. In the late t950s in Africa. 





Contextual Testing 103 



we were usinj^ tests of this type, where a pard^rdph or an anecdote had a 
certain number of items deleted and the candidates had to fill in the blanks. 
One of the things we found in constructing the tests was that we argued a 
great deal about what items should be deleted, and we found that we brushed 
against but never adequately solved the criteria according to which blanks 
should be, in fact, deleted; how far this should be lexical and syntactic and 
how far they should be content related, how much you could leave out and 
\el leave a fair stimulus. I'd like to know whether you have any criteria 
either worked out or in embryonic state about the proportion of blanks to 
texts -I notice that there's a bit difference between the Chinese text and some 
of the others -the frecjuencv with wh»ch these blanks occur against the total 
number of words in a sentence, the lexical and syntactic spread, which I 
imagine is a function of the test? And also related to this. I d like to know how 
far you regard it possible or likely that tests of this kind could be applied to 
the R-4 and R-5 levels'^ 

Bondaruk: In answer to your first question, we don't have any empirical 
criteria for counting the number of blanks. I think we started at one end of 
the spectrum and left virtually everything blank and have been working our 
way back, working! with relatively known populations, or populations that we 
could gel a good ranking for first, before we tried out the test so that the 
population was the constant and the test the variable. As far as using this type 
of lest for the higher levels. I think we could probably get quite a bit of 
milage at the 3 level. I think to blank out any more would get indeed into 
what Professor Nickel is calling a puzzle. 

Ellis: I've got a question that arises from a comment froi.i one of the speakers 
concerning redundancy The comment was that the passage, when given to a 
native speaker or near native speaker, gave the result that the native speaker 
could understand it. This. I think, implies that the items that have been left 
out from the test were highly redundant. This being so. what is this test, in 
fact, measuring? I think this raises a more fundamental question of what the 
cloze test itself measures. If. as I deduce, you are measuring redundant items, 
then it s going to be of very limited value in language testing and teaching. 
Bondaruk: Obviously we're not terribly interested in examinees' being able 
to produce redundant items, items which are quite learnabie and learned very 
early in the process. I would say that, in our view, given the nature of second 
language competence, the ability to supply the redundant items is a clue that 
they're following the true syntactic relationships, that they're reading the 
relationships dmong the major sentence constituents oi phrase constituents. 
In other words, they're using that to demonstrate that thi > are indeed follow- 
ing the basic algebraic message of the sentence. That oiniuusly needs to be 
supported by data, however. Ehit that is our intention. 

Spolsky: I think instead of referring to this as contextutd testing, if one were 
to refer to this as cloze testing, then it would very simply tie up with other 
work on cloze testing and one could get the justification for that. But a 



ERIC 




104 Testing Longua^?e Proficiency 



couple of inlercs!in>; points come o«; here. One is !o ask what exactly this 
is the test of. and I would ar>>iie that this isn't a test of onU reading compre- 
hension. It's a wa\ of testinji throiijjh re.:din>; the overall and underlying 
knowledge of a langua>;e. One of the uther intertjstin^ things is the interiiction 
here between linguists and psychologists in tackling a particular problem. It 
is interesting to note that psychologists are taking regular methods of dealing 
with a discrete item test and applying them to what is really an integrative 
test, discovering then that the way in which you normally read item analysis 
fs quite different for this kind of reading of item analysis. 
Bondanik: I suggested that item analysis was one aspect of looking at these 
data. The heart of the matter is the identificBlion of clusters and linkages, 
and I'm hoping that the power of the factor analytic technique, when applied 
to t'nese data, will provide us with the kind of resuhs that will lay out the 
connections and linkages within the structure. 

Clark: I'm wondering if you're stressing this as a test of reading comprehen- 
sion as opposed, let's say. to the ability to be a translator. Do you want to 
stress slrough the notion of this as a reading comprehension test as such, or 
would you prefer lo see it as a test of other, more productive, skills? 
Child: I'm not exactU sure what it does do. to be perfectly frank. Obviously 
I'd like lo see it cover more than reading comprehension, and there is cer- 
lainl\ a language production element involved here. Things have to be sup- 
plied that aren't there, and these things that have to be supplied are more or 
less trivial, depending on the difficulty of the text. I suppose with a great 
man\ items deleied. you have a problem of much greater magnitude than 
ordinary reading comprehension would suggest. So it may well go beyond 
reading comprehension, but I think the latter is subsumed in the exercise. 
Bliss: I would like to know how you select the passage, how you match them 
with the level of proficiency vou want to test, and how you make the selection 
of the items you'll be deleting? 

Bondanik: The selection of the passage is done on negative criteria. One of 
the things I try to avoid is getting a passage that deals with special subject 
matter that would require other than linguistic knowledge. At that point it's 
simpU intuitive. Whether or not the test is operating on the correct level has 
to be determined empiricalU tifterwards. We simply have to try out several 
forms of the tost until we hit the right level. As far as what items we delete, 
I've already said we started out with almost everything deleted, and then 
we've sort of gradually been working our way back. There was a conscious 
decision in some of the tests to delete members of closed systems, the articles, 
determiners, auxiliary verbs, that sort of thing. In the other form of the test 
there was a decision to delete basically content words. This is still experimen- 
tal, though. 



ERLC 



111 



Some Theoretical Problems and Practical 
Solutions in Proficiency Test Validity 

Calvin R. Petersen and Francis A. Cartier* 



The Defense Lan>?iiage Proficivmcy Tests, commonly referred to as 
DLPTs. are tests designed to measure the reading? and listening profi- 
ciencv of all Defense Language Institute (DLI) j^raduates. as well as 
other individuals connetaed with the Department of Defense (DOD) 
who claini proficiency in a particular foreign language. The DLPTs 
therefore serve tuo major purposes: (1) they are used to measure the 
reading and list»>ning comprehension skills of DU students upon 
graduation or upon entr\ into advanced training and (2) they are used 
elsewhere in the Arm\. Xavy. Air Force and Marine Corps to evaluate 
a serviceman's abilitv to meet the linguistic requirements for a par- 
ticular military job. either here or overseas, and to indicate his capa- 
bility on DOD records. 

Thv; history of military language proficiency testing goes back to 
I94>j and what were then called the Army Language Proficiency Tests. 
Tests in 31 languages were developed and used between 1948 and 
1953. Because of military pressures at the time, the development of 
those tests involved onl\ a minimum of research. During and follow- 
ing the Korean War. it became apparent that the tests were not always 
useful in discriminating between individuals of different levels of 
language abilitv. As a result, the development of new Army Language 
Proficiency Tests was directed and accomplished by the Army Per- 
sonnel Research Office. Accordingly, in 1954 the Adjutant General's 
Office began the^ntroduction of new tests in approximately 40 lan- 
guages. We now refer to these as the DLPT I series. In subsequent 
years, it became apparent that while they were an improvement 
over the earlier ALPrs. the DLPT Is were not as valid or as reliable as 
had been assumed. Furthermore, there was only a single test form for 
each of these tests, and (heir use over more than a decade had re- 
duced their effectiveness due to test compromise. Recognizing these 
deficiencies, in 1966 DLI initiated a series of new projects aimed at 
developing the DLPT II series to replace the DLP r Is. The first group 



*Thc opinions cxprcs.sod m this pnp»r arc the authors' and do not necessarily represent 
the policies of the Department of Defense. Department of the Army, or of the Defense 
Language Institute (D!.!). 



105 

112 



106 iesting Languai^i: Proficiency 



of DLPTlls uas developetl under conlnicl with (he {iducalional Test- 
ing Service (ETS). iisinj» liin>»iia>»e experts from DIA and various uni- 
versities, in addition to KTS resources. Seven hi^h-enrollment lan- 
>»ua>»es were included. Since that time another 14 lan>»ua>»es have been 
added to the list. With a few exceptions, all are four-choice, machine- 
scorable tests v\ith 60 items on reading* comprehension and 6t) items 
on listening* comprehension. 

In 1970 the Sv stems Development Aj»enc\ (SDA) of DLl was created 
for the* purpose of ct?ntrali/in>» course and test development. From its 
inception one of the major problems encountered In SDA has been 
the adaptation of DLPTlls to the mili: ir\ testin>» s\stem in general. 
This has required us to reexamine the basic problems of language 
proficiency test validation. 

.-\ccording to present Amu procedure, scores on the DLPT are used 
to assign proficienc\ levtds in listening and in reading. There are six 
levels defined in the military regulation. Kach level has a code and a 
name, accompanied b\ a definition/explanation. I or example, level 
R-:j is labeled "Minimum Professional Reading Com[)r«;hension" and 
carries with it a number of functional descriptions such as "Able to 
read standard newspaper items. . . . correspondence. . . . and technical 
material in his special field. Can grasp the essentials . . . without using 
a dictionar\; for ticcurate understanding motlerate use of a dictionarx 
is re(iuirc»d. . . ." 

There is a Dl.PT I cut-off score for each proficiency level. For exam- 
ple, a raw score of 41) on the reading test of the DLPT I would be con- 
verted to. and reported as. R-3. Scores of 215 to 39 correspond to R-2. 
and so on. Thus, scores on a multiple-choice reading and listening test 
aro used as predictors of proficienc\ defined in terms of "real-life" 
language behavior in a large number of situations external to the test. 
When SDA was assigned responsibilit\ for the DLPT II series, this 
nit nne of test interprc^lation came increasingU under (|uestion. The 
un leniable (lesiral)ilitv of maintaining the administrativeK conven- 
ient procculure conflicted with current practice in test development 
j»nfl validation. 

The* r»itionah» for (bnelopment and utilization of educational tests. 
« spe(.iaIU tests which are the l)asis of important assignment deci- 
sions, should certainU l)e ).»roun(le(l on standards of acceptable testing 
practice. The ac(;epte(l standards of the professiouid testing commu- 
nity in the I'nited States are outlined in a booklet entitled Standards 
for Development and (^se of Educational and Psychological Tests, 
preparc»(l In the AnuMican Psychological Association and the Ameri- 
can Hducational Research Association. The standards are currently in 
the procc^ss of revision l)y these* two organizations, b\it the basic prin- 
ciple's of test validation are vv(dl established and are largely un- 





Some Theoretical Problems and PracticQl Solutions 107 



chanj?ed in the draft revision we have seen. 

Three types of test validitv are described in the Standards: Cri- 
terion-related validity, content validit\. and construct validity. Be- 
cause there are some lerminolo^ic<d problems in this field at the 
moment, let us briefly clarifv the usa^e that we will use here. 

• Criferion-reJated validity is estimated by comparing test scores, 
or predictions made from them, with an 'external variable (the cri- 
leffon) which is considered to provide a direct measure of the char- 
acteristic or behavior in question. (Note: The concept currently 
referred to as criterion-related validity is relevant to. but is a separate 
concept from, "criterion-referenced tests.") 

•Content validity is evaluated by determining how well the content 
of the test samples the class of situations or the subject matter about 
which conclusions are to be drawn. • 

• Construct validity is evaluated by investigating what psycho- 
logical qualities a test measures, i.e. by determining the degree to 
which certain t» planatory concepts (' constructs") account for per- 
formance on th*^ • ;t This is the approach used, for example, in devis- 
ing tests of the dm-' u-A "intelligence" or the construct "anxiety." 

The definitions ^iv^n above are over-simplified, but it should be 
apparent that each approach to test validit\ is associated with differ- 
ent underlying strategies and with different methods of test develop- 
ment and use. It is therefore essential to establish the relationship 
between the purpose of a test and the proper procedures for estab- 
lishing its validity. 

In the case of the DLPT. the use of the test as a predictor of real- 
life language performance implies criterion-related validity. How- 
ever, except for the Russian and Chinese prototypes, the criterion- 
related validity of the DLPTs has never been established. What DLPT 
scores do provide is relative standing on a test which is essentially 
a sample of the content domain "language proficiency." But while it 
is probably logical to assume that the DLPTs sample the relevant con- 
lent domain, the conversion of DLPT scores to skill levels is not war- 
ranted on the basis of content validity alone; to do that, we would 
have to actualK establish the criterion-related validity of the DLPT 
by correlating them with external criteria. To perform such a con- 
version, relevant real-life performance criteria must be collected in 
quantitative form and the DLPT scores correlated with them to pro- 
vide predictive validity coefficients and corresponding errors of 
estimate The next best thing would bo to correlate them with inter- 
view ratings, but even this procedure leaves the test uncorrected with 
the true criteria which the interviews are meant to predict and, fur- 
thermore, entails some other problems we will discuss later in an- 
other context. 



ERIC 




108 Testing Language Proficiency 



The major question that then arises with regard to the predictive 
validity of the DLPT is what the criterion measure ought to be For a 
complex concept such as "language proficiency." a large number of 
behaviors are implied, and no single criterion nr.ensure can be re- 
garded as adequate to establish (he predictive validity of the test. 
What we are dealing with is a broad, hypothetical construct for which 
no single criterion measure is adequate. The concept "language pro- 
ficiency" therefore appears to have much in common with such con- 
structs as "intelligence" or "anxiety/' insofar as the required proce- 
dure for test validation is concerned. In short, in the absence of data 
from an external criterion, the conversion of DLPT scores to pro- 
ficiency or skill levels, as currently defined, cannot be justified in 
accordance with the APA Standards without construct validation. 

An adequate discussion of construct validity is not possible here, 
but the important point with regard to the DLPTs is that the collection 
of multiple criteria and the completion of the associated data-analysis 
for establi.shing the validity of the construct "proficiency level*' in 
even one language could be an impracticably large task. In any case, 
the psvchometrics of both criterion-related validity and construct 
validity require larger sample sizes than are available in most lan- 
guages taught at DLL Thus, since it is not possible to perform the 
criterion-related type of validation that is a pre-requisite for conver- 
sion of DLPT scores to skill levels, and since construct validation 
presents enormous theoretical and practical problems, the most rea- 
sonable immediate approach to establishing the validity of DLPTs 
appears to be through content validation. 

Content validity has a number of advantages and some disadvan- 
tages with regard lo language proficiency test development and use. 
For one thing, the burden of validation falls primarily on the disci- 
pline of linguistics rather than on statistics. The various "parts" of the 
domain of "language proficiency" must be defined and represented in 
appropriate proportions on the test. Psychometrics can assist in this 
procedure, but it is no longer the major controlling discipline. Thus, 
for example, the convenient reliability statistic. KR-20. is no longer 
a controlling measure of test quality, since it is merely a measure of 
internal reliability and is not related to the quality of the test as a 
sample of the content domain. If a test of a given length proves to be 
unreliable, it must either be lengthened or revised, with consideration 
given to defining more reliable subcomponents of the domain and 
perhaps providing separate scores. 

Item analysis statistics also take on a different meaning when tests 
are produced according to the principles of content validity. The 
concern is not with producing pure-factor tests or making inferences 
Q ibout the distribution of traits within the population, but with how 

ERJC 110 



Sonu; Theoretical Problems and Practical Solutions 109 



well the items on the lest represent the content domain. Hem anahsis 
therefore becomes less of a tool for conslruclin^ and editing tests 
than a method for evahiating them in operation. Also, interpretation 
of Item anahsis data becomes more of an "art form" than a standard 
statistical technique. For example, an item which fails to discriminate 
could perhaps be (I) ambiguoush written. [2] not a fair sample of the 
content domain, (3) could be diagnostic of a weakness in the course. 
(4) indicative of a homogeneous range of talent on <i particular skill, 
or (5) some combination of the above. In other words, so-called "bad" 
items (as indicated b\ item anahsis data) should not simply be dis- 
carded and replaced b\ items which are capable of producing higher 
internahconsistenc\ reliabilities, but should be used as clues leading 
to judgments regarding improvement of both testing and instruction 
and to better understanding of the student po;)ulations. Perhaps the 
major disadvantage of content validation froui the military point of 
view is that test scores cannot be converted ,si :tisticall\ to proficiency 
levels. (That, as was pointiui out earlier., n . quires criterion-related 
validation.) Score interpretation should be norm-referenced, which 
means that a student's standing, relative to the normative sample, 
can be identified, and all students can be categorized in terms of 
relative skill, but we cannot sa\ that a particular score represents, for 
example, "minimum professional reading comprehension." In es- 
sence, all that can be inferred from norm-referenced, content vali- 
dated lest scores is rank ordering of students on their ability to per- 
form the tasks within the tested content domain. However, that is not 
a disadvantage for selection of personnel if the DOD's purposes in 
using the scores are to identif> the best {or worst) military "linguists" 
for a particular assignment without "pegging" them to a particular 
skill level. In fact, the norm-referenced scale will allow much finer 
discrimination of DLI graduates instead of merely categorizing them 
as Level 2 or Level 3 as is called for by the present system. 

In any case, even if suitable external validation criteria could be 
found to legitimize the use of the DLPT as a.i indicator of skill levels, 
the uL"fulness of converting DLPT scores of DLI graduates to skill 
levels appears lO be questionable. Because the proportion of students 
who achievLf as high as Level 4 (as now defined) even in an advanced 
course will always be extremely small, and those that do not achieve 
as high as Level 2 are not likely to complete basic course instruc- 
tion, virtually all students will be either Level 2 or Level 3 at the time 
of basic course completion; thus, the information derived from a 
validated DLPT on such a population will be essentially dichotomous. 
In other words, the end result of an extensive assessment effort is 
mereh a two-category scale. Since there is every reason to believe 

'di language proficiency is a normally distributed, continuous vari- 



no Testing Languoge Proficiency 



dble within the basic coiiise population, the iii>e of level descriptions 
simpl\ reduces the number of possible discriminations which can 
be made. 

As was pointed out earlier, all DLPT I re^din^^ scores from 26 
through 39 are mereh reduced to the code R-2, and the minimum 
criterion for R-3 is a score of 40. 

Now, if that procedure i^ examined closelv, it will become appar- 
ent that, as traditionalh used and interpreted, the DLPT might best be 
regarded as what people have recenth been calling a criterion-ref- 
erenced test. (More specificalh , it has been regarded and used as a 
criterion-scored test, which is one form of CRT.) In other words, a 
particular skill is derived from an absolute or criterion score on the 
test rather than being a function of relative skill (rank) within a de- 
fined population. This has created the problems referred to earlier 
regarding the introduction of the DLPT II series. To simply plug them 
into the scoring s\stem in the existir»g regulations was inappropriate 
because the\ were not parallel forms to the DLPT Is. The means and 
variances v\ere different, and correlations between forms could not 
be obtained. There ma\ be some question as to why these factors are 
of importance because, in both series, the indices were derived from 
norm-referenced comparisons. However, differences in norms or low 
correlation between forms is evidence that the same group of students 
will show considerable variance in their proficiency level depending 
upon which form of the test the\ take. This is expected to be a prob- 
lem each time a new DLPT form is introduced. Also, the state-of-the- 
art in criterion-referenced test development is such that questions of 
how to construct (or even what constitutes) a parallel form of a cri- 
terion-referenced test are not yet fully resolved. 

For these reasons, SDA has recently recommended that DLI adopt 
a content validation model for future proficiency test development 
and interpretation. This recommendation is now under consideration. 
If it is approved, we will be released from the dilemmas presented by 
the criterion-referenced model and will instead be required to begin 
developing tests based on the best present collective judgment of the 
n<»ture of the content domain "language proficiency." Future improve- 
ment of the validity of the testing system will entail efforts toward 
improved delineation of that domain. Despite this special problem, 
the content domain model has appeal in a number of respects-espe- 
cially since we believe there is a practical, immediate solution to the 
validity problem, which we will consider later. 

It is appealing, to begin with, because the content-sampling model is 
defensible as applied to a teacher-made classroom test as well as to 
more complex test designs. It does not require large sample sii^es nor 
O jtensive periods of proof administration. It is flexible in meeting 



ERIC 




Some Theoretical Problems and Practical Solutions 111 



chcin^in^ nutina^emenl siliidlions. When tests are com[)romKsod or are 
otherwise in need of re[)lticement. no com[)le\ stdtisticdl e(|Uciting of 
d new form bdck to a fixed cut-off score is re(|iiired. That is not to sa\ 
that It is not desirable to have stcUisticalh parallel tests, but the re- 
quirements of sam[)le size. [)roof-ailiniiiistiation time, and data analy- 
sis make statistical e(|uating at least difficult and sometimes impos- 
sible at DLL Also the [)resent procedures and circumstances have 
resulted in the confusing requirement of equating a new test to an 
old one that is known to be com[)romised but to an unknown degree. 
It therefore appears to be much more rational to develof) alternate or 
new lest forms which are linguistically [)arallel and score them on a 
stand^ird scale (T-scores). Obviouslv. the usefulness of this approach 
depends upon finding an acceptable solution to the problem of estab- 
lishing content validity. 

One uf the fundamental ^)roblems of language training and testing 
is that the training must pre[)are the student to understand statements 
he will never have heard or read before and to produce responses that 
are uniquelv ap[)ropriate for them. The hv potheticallv ideal pj-ofi- 
ciencv lest would therefore be one in which the student would en- 
counter all [)ossible future stimuli and produce all the appropriate 
responses. This ma> be possible in training a student to cop\ Morse 
Code, which is relativelv small content domain, but is a total impos- 
sibility in language training. Instead, like most tests, a language pro- 
ficiencN test can onh sami)le the relevant stimuli and responses. The 
first two problems in establishing content validit\ of a test are there- 
fore to determine what to sample, and how to sample it. 

At first glance, it might ai)i)ear that, in princi[)le. a test of general 
[)roficiencv in a foreign language should be a sam[)le of the entire 
language at large. In practice, obviouslv, this is neither necessary nor 
desirable. The average native speaker gets along quite well knowing 
onlv a limited sam[)le of the language at large, so our course and test 
really only need to sample that sample. (Further discussion on whatio 
sample will follow after touching on the ^)roblem of how to sample.) 

The choice of a sampling method is a mini-max problem: VVhat we 
want is the smallest possible sample that adecjuately rei)resents the 
language. In [)ractice. of course, we usually find that sample size is 
not freelv variable-a test can be only a few hours long at besl-and 
the [)roblem is [)rimarily one of assuring that a rational sample is 
achieved within that limitation. 

In manv other kinds of inquiries, a random sample is the only ra- 
tional sample, but that is certainly not true in sampling a language. 
We know, for example, that the rank-frequency of word occurrence 
gives great importance to a few words and relatively little impor- 
Q *ance to an enormous number of them. A random sample of the lexi- 





112 Testing Ltinguuge Proficiency 



con v\ouId not be a rtiliundl sample, \or vxoultl we be salisfietl with 
ci random Scim[)le of the phonemes or graphemes of a language. In 
Vn»tnamese retiding tests, for example, we need not lest es[)eeially 
for lecognilion of the letter m» cilthough we might want to be sure we 
lest for recognition of all the tone marks and tliacritics. 

Most language tests, inchuling DLl's tests, therefore make a kind 
of stratified random sam[)le. assuring b\ plan that some items lest 
grammatical features, some lest phonological features, some lesl 
vocabulary, and so on. Thus, for exam[)le. the DLl's English Compre- 
hension Level tests are constructed according to a fairly complex 
sampling matrix which rct|uires that s[)ecific [jercentages of the total 
number of 120 items be tievoted to vocabular\. sounti tiiscrimination. 
grammar, idioms, listening com[)rehension. reading comprehension, 
and so on. Several \ears ago. DLLs SDA tried to determine the feasi- 
bility of establishing a universal item-seleclion matrix of this sort for 
all languages, or perha[)S for all languages of a familv. so that the 
problem of making a stratifieii samjjle for test construction [)ur[)oses 
could beretiuceti to a somewhat staiuiarti [jrocetiure. However, such a 
matrix has not. as vet. been found, anti until it is. we must use some 
method for esUiblishing a rational sam[)le of a language in our tests. 
Therefore, we are [Jiirsuing a different line of thought, which returns 
us to the to[)ic of what to sam[)le. 

It IS a relatively sim[)le matter to construct an achievement test of 
high content validitv This is because an achievement test is designed 
to sample a s[)ecific course or [)art of a course. In this case, the entire 
tlomain to be sam[)led is readilv available in the text, the tapes, and 
the classroom activities. Furthermore, ex^)erienceti instructors can 
fairlv easilv identifv those asi)ect*> of the course that are so easy for 
students to grasp that thev need not be tested, anti those tis[)ects thai 
will give reliable intiicators as to the relative achievement of the vari- 
ous slutlents in the course. 

If it can be established convincinglv that, for all [)rac^ical pur- 
poses, the course is a rational sam^)le of the language at large — or 
even of an average native s[)eaker*s sample of the language at large — 
then a rational sam[)le of the course would serve as a valid item- 
spt i.ification matrix for the construction of a general i)roficiencv test. 
Obviouslv. the valitiitv of this argument rests on the validity of the 
assumption that a DLI F3asic Course re[)resents a rational sam[)le of 
the language. This assum[jtion is [)robablv tenable with regard to the 
grcmimalical structure of the language. It is difficult to believe that an 
intensive course of 30 hours a week for 47 weeks -or even 24 weeks — 
tlo^»3 not ailequatelv sample the grammar of the language. 

But is the assumption tenable for vocabulary and such other as- 
^)ects as itlioms? As was mentioned earlier, in every language a few 

110 



Some Theorem * 'rob'ems ai)d Practical Solutions 113 



words occur with verv lii^li fre(iuenc\, a somewhat greater number 
occur with lesser frequency, and a vast number occur with very low 
frequency. Most courses take this fact into account when establish- 
ing the vocabulary objectives for learning, although it is rarely the 
only factor considered. One other factor is the im[)ortance oa a word 
to the student. In some instances, convenience in writing the dialogs 
and narratives tfiat a[)[)ear in the text undoubtedly accounts for the 
inclusion of a few words. But if relative frequency and importance to 
the student are rational ajjproaches to sam[)ling the language, then 
for all practical pur[)oses the argument holds for vocabulary as well. 
Furthermore, there is research to su[)[)ort the idea (first explored in 
depth by George Zi[)f) that the highest frequency words are the words 
of highest utility and are also the words that come most readily to the 
mind of the native s[)eaker when given the a|)pro[)riate stimuli for 
association Therefore, in theory, if the course writer is a native 
speaker, he wili almost necessarily put into his course those words 
of greatest frequency and utility from his [)articular sample of the 
language at iarge. It is not difficult to hy pothesize, then, that the same 
will be true of idioms. If such an assumption is valid, then it can also 
be assumed that any language course of truly substantial length (such 
as an intensive course of 24 weeks) will contain, for all practical pur- 
poses, a rational sam[)le of the idioms. At present these assum[)tions 
have not been [)roven. However, Vve believe they are at least as valid 
as [ht asumptions we would have to make in order to argue for cri- 
terion-related or construct validity of proficiency tests. In summation, 
then, validating a lest of language ability by establishing its predictive 
validity is extremely difficult, if only because the possible criterion 
behaviors are theoretically infinite, will be different for each indi- 
vidual, and cannot be known in advance. Construct validity is simi- 
larly difficult and, like predictive validity, requires large numbers 
of stuflfMits for field testing 

Content validity, then, appears to us to be the only feasible ap- 
proach at this time. This approach also presents some theoreticrd and 
practical problems, since the item specifications for a proficiency test 
should constitute a rational sample of the language. But for our pur- 
poses it seems defensible to consider the present DLI courses as 
rational samples of the language and to sample them, in turn, for {he 
item objectives of our tests of general ability. The resulting tests 
would therefore serve quite satisfactorily as proficiency tests. 

Perhaps the most important advantages in adopting content valida- 
tion of DLI "proficiency" tests are that it should greatly simplify the 
entire testing system while providing more useful information to deci- 
sion makers. It should also more clearly separate those test develop- 
ment activities which are intended to meet ongoing personnel deci- 




114 Testing Languaj^e Proficiency 



sion needs from those which are intended to advance the state-of- 
the-art. 

While the content domain model and content validation appear to 
be the onlv practicable approaches to our testing problems at this 
time, d substantial effort must be devoted to research into the domain 
and into the search for external (real-life) criteria employing the 
criterion-related approach. Such research, of course, is also relevant 
to the problem of ob)ective-settin)4 and to learning-system develop- 
ment and evaluation. It is therefore necessary if we are to continue 
to improve the correspondence between our training and evaluation 
s\ stems and the needs of the user agencies who provide us with a 
reason for existence. 

DISCUSSION 

Spolsky: Huu would vua (hslin^^uish a [jrofiuencv lost with cunlonl validity 
based on the s> llabus from an achievement test? 

Cartter: I wuuliln l uecessarilv. You have to make a whole raft of assump- 
tions about whether the course itself is some kind of rational sample of thq 
lan>>iia>{e at lar>>t;. or at least the lan>jua>>e that vou're tryin^j to teach -to 
militarv [)eo[)le in m> case, it seemenl to us to he a [)raclical solution to a 
verv difficult theoretical prohh.'m to do that. In other words. I would use an 
achievement test htised on li satisfactory syllabus for the same [)ur[)oses that 
I would use a proficiency lest, if I had one that was criterion-validated. 
Htndmarsh: My particular interest in relation to tests is in the establishment 
of langiia>{e syllabus s[)e( ifii.ations. and I see this in a broad context of not 
merely the language elements, but also the suciolinguistic and [jsycholinguis- 
tic [)arameters thtit rebite to such a s[)ecifictttlon. In your descri[)tion of con- 
tent validity you refer to both the subject matter and the class o' situations, 
rd like to know how far you have related elements in content validity to each 
olhtr. and in what [)ro[)orlions. As I see it. you have to handle not only the 
syntactic and lexical items. l)Ut also the ofierations that are done with these 
items -the operative skills. Those skills ttike [)lace in a context, and as soon as 
you go up to the context level you're in the sociolinguistic/psy cholinguistic 
(bmiain. Tin wondering how far you have afjfjroaohed it from the language 
presentation, from the intenlionality of the s[)eaker? 
Cartter: I've a[)[)roached it just as far as the [)a[)er has gone at this point. 
Clark: Is the primary [jtirpose of testing in the DIJ context to rank people on 
some language ability or achievement? 

Cartter: One of the [)ur[)0.ses is to rank them at the school. The other reason 
for testing at all is to try to give some indiciitiiju to the military unit as to 
which of them are ca[)able of doing the job. They v.ani us to tell them that if 
Jones is a level-3 man. he can do the kinds of things which are represented 
by the phrases in the descrifWion of S-3. R-3 and so forth. With interviews 1 





Some Thecreiical Problems and Practical Solutions 115 



think we can approximate that. Of course, as t said, even the interviews have 
never been validated a>»ainst a real criterion. With the paper and pencil and 
tape recorded tests, the Defense Langiiaj?e Proficiency Tests, we can reliably 
rank them, put them in a T-scale, Since we have no w;j\ of «ettin}3 access to 
the criterion population to get a metric from which to maxe a criterion-related 
validation of the DLPT. we are unable to say that a particular score on the 
DLPT represents level 2 or level 3. So we provide the information we can to 
the iiser for his purposes in deciding where to send Sergeant Jones and where 
to send Sergeant Smith. 

Clark: The reason that I asked thai question is that the achievement versus 
proficiency testing nomenclature might be a red herring in the sense that you 
really don't care whether it's an achievement test in terms of content and 
syllabus or a proficiency test in a sense of being able to do something for 
real-life purposes, because in either event, regardless of the name of the test, 
the ultimate validation would be against some as yet unavailable criterion. 
Cartler: That would be true except for a complication that I only mentioned 
very briefly in the paper, and that is that the DLI courses are not the sole 
source of military linguists- by linguist in the military we mean a man who 
speaks a foreign language. They come from other sources too. A man grew up 
in a family where his mother spoke Serbo-Croatian, so he learned Serbo- 
Croatian. When he comes into the Army he claims to be highly proficient in 
Serbo-Qroatian. We need to have some way of the finding out whether, in 
fact, that's true. So we have to use some kind of test to do that. Obviously the 
Air Force or the Army would ^'^^ to have the statement we make about that 
man be comparable to the statement we make about the graduate. 
Clark: So it would be a proficiency test for the people coming in from the 
outside with background knowledge, and an achievement test for those peo- 
pie who went through the course. 

Cartler: In effect it would be. and this is a point discussed at some length in 
the paper. The validity of this point is one of the things that frankly we were 
hoping to get some ideas from you people about. 

Wilds: I'm wondering if you're going to be able to extricate yourself from 
S and R ratings. I'm not clear if people want to know what those ratings are 
for your graduates, how you are going to supply them, or if you're not 
going to supply them, how you're going to talk people out of wanting them. 
Cartier: As I said in the paper, these matters are under consideration by 
Headquarters DLI at the piesent time. I think it's premature for me to say 
what that decision would eventually be We would like to be able to satisfy 
everybody, and maybe we'll figure out some kind of system for doing that. 
Spolsky: I'd like to come back to the problem of relating the tests so closely 
to the syllabus. Once you're successful, you'll never be able to arrive at any 
satisfactory judgment of how to change either. Having a test that is independ- 
ent of the syllabus will give you a chance of complaining that the test is not 
doing well for your students, and therefore you worry about the test; or that 

EjIc • 122 



116 Testing i.on«n(i««; iVoficitjncy 



Iho s\liahus is not doiUK uoll for \our sUuienis. and thon un\\\ worry about 
the s>llahus. In other \\or(is, as soon as the \\\u of them are based on exactly 
the same anaKsis, unless \ou'\e (iisanered the magic principle underlying 
the structure of language and how to toach \\, then this kind of decision is 
likely to block you from getting anywhere. 1 think that one of the things sug- 
gested earlier was the possibility of using a test based on a syllabus, but based 
on someone else's syllabus, or a test based on an earlier syllabus, or a test 
based on a new syllabus you're thinking of having. But ns soon as the test and 
the syllabus are based on exactly the same analysis, the best you can expect is 
that your students will do better on the test than students who come in from 
anywheie else. 1 think this sort of practical question, the effect on the possi- 
bility of future development of locking the two things together, is one that 
would worry me very much. That is why I would argue very much for a profi- 
ciency test which itself is based on some different kind of analysis. The ad- 
vantage of taking an integrative approach is »hat it is not based specifically on 
any kind of analysis, and therefore remains fairly independent. 
Cartier: We re concerned about this too. We very seriously considered the 
pos.sibility of going to other syllabuses. The difficulty there is that ! think we 
have the only 24-vveok Haitian Creole course in the world, for example. There 
are KSl courses for many of the languages that we teach, but the question 
comes up as to whether the FSl course is a rational sample for military peo- 
ple. An additional rationalization for the procedure that we're suggesting is 
that the course that we teach at OKI is more or less targeted in on the lan- 
guage problems of military personnel. The test is then more valid for making 
the personnel selectio s than the language sample in Serbo-Croatian that a 
man got because he learned from his mother or someplace else. So I think I 
can rationalize thai if indeed our course represents the language problems of 
the military man more than other ways of learning the language do. then this 
procedure is not all that bad. 

Spolsky: The more sure you are of the validity of your course analysis, the 
more willing you should be for your students to take tests that are unrelated 
to it. 

Cartier: There is no disagreement at all that we would like to have a general 
proficiency test for use in the Defense Language program. We want to initiate 
research toward the criterion, and hopefully do some more research into the 
content domain itself. 

Spolsky: But let's say vou make your test and it works. You have a really 
good test of the present form of your syllabus* and it tests beautifully how 
well your studen!f« do with the material. Let's say suddenly ycu realize that 
you would like to change the syllabus. How will you justify the fact that the 
new syllabus produces more. You have a new test and all your students will 
continue to do better on this test. 

Cartier: The reason for changing the syllabus would be perhaps that it's been 
five years since yoii've put it together, and a lot of the terminology has be- 

C 

123 



Spme Theoretica/ Problems and PtqcIIcqI Solutions 117 



ume ohbolele. Another reason fur Lh.in^^in^ the course mi^^ht be that the 
Air Force utinls ub lo place a little more emphasib on one particular language 
skill than ue \e been tlom>; m the [)aj>t. and a little less on an* .her. Another 
n .boii \ou might change the course is that >ou find an iniproxed melhodol- 
^ I' >ou changed the content of the s>llabus. it would be for some logical 
to rn. and that logical reason would he just as applicable to the testing 
I ^ogram as it is to the course design. Therefore, the new test that \ou would 
have to write to represent the content of the new course would also be a 
valid lest even for the outsider who claimed to have learned his language 
elsewhere, because \ou ha\e changed that for operational reasons, the same 
reasons you change the course for. 

Spolsky: And \ou lI be able to continue to guarantee success because you'll 
set the test that fits exactU what \ou re aiming at. and vou'll prove that you 
get It better than anyone else, 
Cartier: I certainly hope so. 

Oiler: I think that ma>be we could suggest « I'^ther ki.nd of validity here. I 
don't know what one might call it. perhaps false val di.v. If your course is 
realK not teaching /he language, but is teaching certain things on \our course 
s>llabus. and if the test validity is related to how well the test measures 
what's in the s\llabus. then the test could be a valid test in terms of what's in 
the s>llabus. and still not be a measure of language proficiencv. That's exact- 
K what happens in the discrete point philosoph\ of testing, when you get 
learner grammars distorted to the point that second language learners score 
higher on certain items on a test than native speakers do. You train them in 
a wa\ that is not realU within the normal limits of the grammar of the lan- 
guage, but \ou train them in a wav that has to do with how you've defined 
>our s>llabus in terms of some discrete point teaching philosophy. Then you 
lest them on the basis of a discrete point testing philosophy, and you discover 
that the> score even higher in some cases than native speakers do. That 
would be a case. I think, of false validity. Why do you have to distinguish 
between achievement tests and proficiency tests? People have said, if you 
gi\e a proficiency lest or an achievement test that's really a proficiency test, 
then people start teaching to the test, and with a discrete point test that can be 
disasterous In the case of integrative tests it doesn't seem to be a particular 
problem That is. it's very hard to improve scores on an integrative test unless 
you teach the langudge. So it seems to me that we might do well lo at least 
challenge in our thinking the dichotomy, the dualism, between achievement 
and proficiencv testing, and to think about the possibility that proficiency 
tests might bp I'sed as sort of course exit evnminations. and might be used as a 
basis for motivating what happens in the course. 

Davies: It seems to me that if one argues that a diagnostic test is a kind of 
non-achievement test, then in a way all achievement tests are essentially 
diagnostic. What one really wants to know is what people are not doing. If 
one is going lo do anything about it. this is really what the feedback is sup- 




118 Testing Language Proficiency 

posed to be for. That being so. it seems to me that the value of a proficiency 
test in the kind of setup we heard about from Dr. Cartier is that it could be a 
means of validating one's s\llabus. This seems to me to be particularly the 
value of having a test alongside the s\llabus that one is using at the moment. 
Otherwise the syllabus is there because one thinks it ought to be there. And 
as we all know as language teachers, we often wonder really whether this is 
the right wa> to do things. The value of having a proficiency lest alongside it 
is that it's one means of f-ading out. 

Cartier: The only thing I could say to that is the same answer I gave to Bernie 
Spolsk\. and that is. we would dearly love to have what we've been pretend- 
ing to have, and that is a proficlenc> test validated against an external cri- 
terion, or validated in some legitimate fashion. We don't have one at the 
moment. The practical problem is simply that at the present time I have no 
access to the criterion population. And so I am proposing something that will 
keep us going in somo kind of legitimate fashion until we can get some of the 
research done into content and criteria and come up with a proficienc\ test 
that we can stand behind. 

Clark: You sa\ that there are certain pragmatic and practical needs that you 
have to face and resolve, and I'm ver\ sympathetic with that. I would suggest, 
and I think this might satisf> some of Bernard's criticisms or observations, 
that one of the main concerns is that the same people who are doing the 
course are also doing the tests. I should think that you could identify people 
who are familiar with the military oiiuation. but who are not directly asso- 
ciated with L)Ll especially not with the teaching part of it. to just take a 
thorough look at the test and sa> in their opinion if the language is the kind 
that the operational program re(|uircs. I think that might be of some practical 
help in the validation process. 

Cartier: Indeed it would. We are beginning to use what we call TLA's. or 
technical language a(U i.sors. These are military people who have lear*ied the 
language, and have gone out and used it for a number of years. They're not 
native speakers, hut at least the> know She work context. We're getting advice 
from these perople regularly. 

Petersen: We've had different linguists produce supposedU parallel profi- 
ciency tests. We administer them to the same population, and ver\ often we 
find out that they don't correlate highly. I think that this is one of the prob- 
lems with a definition of proficiency. It seems to depend on a particular item 
writer or test constructor as to what it means. 





Two Tests of Speeded Reading^ 

Alan Davies 



In this paper I place question marks against two current topics in the 
language testing literature, those of communicative competence and 
criterion-referenced tests. What I have to say may be construed as a 
criticism of "integrative" or "global" tests. It is not intended to be so. 
The two tests I describe in this paper are integrative or global and 
while I do, in passing, query their value, I conclude by suggesting that 
global tests which do not pretend to be anything else can be a proper 
part of a proficiency battery, and that, indeed, they may serve as a 
means of resolving the dilemma of choice that seems to be at issue, 
that of either norm-referenced or criterion-referenced, either content 
or predictive validity. 

It is never difficult to relate developments in one branch of lan- 
guage stud\ and teaching to those in another. A recent development 
in linguistics has been the rejection of formalism, of formal models. 
Grammar has moved into semantics so that the boundary, always 
faint, now seems non-existent. At the same time there has been a 
great increase in interest in all areas of macrolinguistics (Lyons 1968) 
and particularly since the slowing down of the 1960s' thrust in psy- 
cholinguistics (Bruner 1974) and in sociolinguistics. Hence the at- 
tempts not only by sociolinguists and ethnomethodologists but by 
microlinguists to look at discourse, i.e. to accept that the sentence is 
not the absolute upper limit of analysis. 

In language teaching there has been a similar move into non- 
discrete and often mixed areas. Of course, which areas are now a 
matter of dispute cind uncertainty. Again there has been what is 
regarded as a failure of formalism, the failure of the New Key and 
the general structural approach and the seeming inability of applied 
linguisrs to formulate how to handle the implications of generative 
grammars. Parallel to the interest in discourse 1 have mentioned has 
been the grov\ing feeling that language teaching lacks situation, and 
here 1 do not mean the simple deictic language teaching situation 
through realia or pictures used in beginners* courses. Nor do 1 mean 
the normal -and useful -provision in courses such as English for 
sp6?cial purposes of help with text cohesion and the various intersen- 
lenlial devices. Instead 1 think here of the need of advanced learners 
for an introduction to the rules of discourse, i.e. some help with the 



119 



120 Testing Language l^wficwncy 

\\ci\s in which discourse can use \cirialion as a process and fashion 
uilhin-lexl meanin»4s for its tokens (Widdowson 1974). Of all recent 
deNelopments in langua>»e leachinji this link-up with discourse analy- 
sis seems to me the most promisiiij*. It is the most concrete attempt to 
formalize commnnicati\e com[)etence for language teaching purposes. 
However, it has serious difficulties. namel\, that all attempts so far to 
describe discourse start from a national framework and end up in un- 
certaintN. Speech acts and speech functions are of interest as ideas to 
teachers and learners, but the\ remain undescribed. And whcU is not 
described cannot properh be tested. 

In language testing ^he same mo\e and the same rejection can be 
obser\e(I. Alreadv in the l9()Us discrtite point testing was being 
queried (Davies 196B). Global and integrative tests have become 
muKi attracts <j and are being justified under the aegis of^communica- 
ti\e com|3etonce. It is. of course, one thing to borrow some other 
discipline's tht»ur\ if it contains a usable formalism (e.g. psycholin- 
giiisti(.s borrowing from *r.G.|. but (|uite another to borrow some other 
discipline's notions and then use them as notions once removed. Lan- 
guage tests cannot both t<;st the communicative competence h\pothe- 
sis and at the samti time justif\ themsel\t;s b\ a theor\ of communi- 
cative competence. 

In all these (le\elopments a two-fold argument is implicit: note that 
sometimes one part is made, sometimes the other., rarely both: 

|1| any anaUsis is false to the truth of the language, especially any 
formal anaKsis. because it cannot get ever\ thing in. or because, sim- 
ply, formal anaKsis is wrong. 

\2] this particular anal\sis is false because it ignores the necessary 
data (hat has be<m idealised awa\: therefore it ignores not just what 
is peripheral but what is central to meaning and to language, e.g. 
context, \anation. T\picall\. the grammar, the structural and trans- 
formational drill, the discrtjte point test item ha\e a(xepted the need 
for idealisation, i.e. tbe\ ha\e been selected as exemplifying features 
of linguistic comptjltmce. The existtjuce of a gap between th-jm and 
the beha\iour the\ art; intended to represent has alw.iys been ad- 
mitted. e\en indt5t;(l lAons' (1972) discussion of idealisation accepts 
the existence of and argues the need for such a gap in linguistics. 

The mo\e awa\ referred to above is not just the conse(|uence of 
fashion, nor (lo<;s it reflect a distaste for formal anahsis. It is rather 
a cnns<M|uence of th<; failure, real or not. of these models, namely the 
linguistic one does not satisf\ (he canons of scic»ntific respectability* 
the teaching one does not succeed, students and teachers get bored 
and learning stops, the testing one is abandoned because of knock-on 
effects from both the otht.Ts and because it claims to have (and indeed 
v^' \\ ha\e) more; concern about its validity than either of the others. 




127 



Two Tests of Speeded Reading 121 



Formal models mav have been abandoned, where lhe\ have been, 
for the right reasons; but it is difficult to see. if we nou concentrate 
on the testing field, what is to take their place. PresiimabK communi- 
cative competence: the difficult\ here is how to work idealisation in 
reverse, i.e. to hold constant for the linguistic parameter, just as the 
linguist holds constant (b\ standardisation. b\ decontextualisation) 
for the sociolinguistic parameters. Or is it to be the global test? It is 
surelv significant that along with the development 1 have indicated 
has gone an awakened interest in cloze procedure and the dictation 
technique, both global in their approach. 

I want now to turn awa\ from movements and look at some exam- 
ples of tests. About 10 \ears ago I constructed an English Proficiency 
Test Battery (EPTB. Davies 1967) on behalf of the British Council who 
have since that time made use of the Battery in a number of countries 
as a means of assessing the English proficiency of students applying 
to the Council for scholarships, etc. in order to study in the United 
Kingdom. Most of the students tested were until fairly recently post- 
graduates, but latterly the test has also been used to select people 
applying for technical assistance awards, many of whom will be 
attached for their training to institutions and organisations other than 
universities. For a considerable period the Battery existed in only two 
versions (A and B). but last year Alan Moller. a Council officer, and 
I worked together to produce a C version. The reworking and re- 
writing led me to consider afresh the structure of the Battery and 
recall the original design. 

The rationale I adopted in 1964 was twofold: the BaMery should 
have a linguistic base and a work sample base. This led eventually, 
after elimination of subtests, to a four-part Battery: (1) Phonemic 
Discrimination: (2) Stress and Intonation; (3) Reading Comprehen- 
sion; (4) Grammar. Numbers 1 and 2 were on tape; 3 and 4 were 
written. There was also a fifth test — (5) Reading Speed — which has 
been used as an optional extra. It is of Tests 3 & 5 that 1 want to speak: 
these two represent what remained of my vvork sample selection. 
Tests 1, 2 and 4 represented the linguistic sampling and were. 1 sup- 
pose, discrete point tests, inasmuch as it seemed clear what was being 
tested in each item, a phoneme contrast in 1 or a modal contrast in 4. 
It was. admittedly, less clear in Test 2; and I sometimes wondered if 
Stress & Intonation did not belong more properly to the vvork sample 
selection. Stress and Intonation are notoriously difficult to pin down 
as discrete markers of contrast. But at the time it did not seem to 
matter too much which side of the fence Test 2 belonged on since it 
seemed happily settled on top of the fence. What did matter was that 
Stress and Intonation were part of the language, areas of linguistic 
Q investigation, and they appeared to pose problems to advanced 





122 Testing Language Proficiency 



learners of English. So the work sample tests pioper remained: Test 3. 
Reading Comprehension, and Test 5. the optional test of Reading 
Speed. 

Sampling of the hinguage is. of course, the chief burden placed on 
the proficiencN tester; it becomes also his main strateg\ since it deter- 
mines, other things being equal, what he tests. If it is a problem for 
discrete point tests (where to some extent the sheer accumulation of 
items lessens the weight of decision on the tester), how much more so 
for the work sample test where practicalities such as time make ac- 
cumulation of items impossible and the tester's decision final. Now it 
so happens that in an\ test ver\ few texts can be emplo\ed as exem- 
plars of critical work samples. Of course, as we all know, there are 
ways around this dilemma. The first is through "ideal type" selection, 
a kind of content validit\. in which the tester recognises a particular 
text as being exactK what he wants, representative of all possible 
texts for his population. Such an approach is. of course, guesswork, 
but not uncommon. The other wa\ out. usualK employed in addition 
to the first, is through correlation, either of the concurrent or of the 
predictive kind. Here the tester discovers that his text sampling does 
predict after all. a fortunate outcome and one he is glad to accept 
since he is usualK not also an experimenter who would be seeking 
texts with better and better predictions. Work s^^mpling. then, in the 
choice of texts is a form of guesswork. The guesswork may be con- 
founded b\ \et more guesswork in the method emplo\ed to assess 
comprehension of the text, whether it be written or spoken. 

At this point illustrations of Tests 3 and 5 would be in order: 

TEST 3 

This is a Test of vour understanding of written English. Here are 2 passa^^es taken 
from fairly recent books In each passage a number of the words are shown only 
by their initial letter and a dash. Complete these words to show that voa under- 
stand these passages. 

Here is a short example: 

T i a test o reading comprehension 

If you read the whole sentence you will see that it makes some sort of S(:nse but 
that three of the words are incomplete. Try to complete them. Have you suc- 
ceeded' They are. This, is and of. Thus the complete sentence reads: This is a test 
of reading comprehension. 

Now go on to the two questions below. Work quickly. 
Question 1 

But changes i t home are less revolutionary, a 

easier I assimilate, t changes i industry. Technical progress 

h removed only part o education f I home; long after 

either she o her husband has ceased t be one f their son. 

I mother is f her daughter a teacher w accustoms her t 

^a^licuIa^ way o doing things i t home, a t 

o 129 



Two Tests of Speeded Reading 123 



daughter, s her ways are also her mother's, is likely t fee! t 

she can trust t good sunse o her hulper. 

Question 2 

! ' IS f I . rtMSon I . India became I first area 

t . encounlur t ... . problem o using English a I 

commercial, educational a scientific medium i w are now 

called 'under-developt'd' countries: a problem vv became acute b 

i . middle o t twentieth century i many parts o 

t vvorlri 

TESTS 

Our British policv for speok higher education is tenable only girl on certain 
assumptions The first did assumption is that the numbers yes of voung people 
selected each vear shouldn't for the nation's needs weather. The thiul assump- 
tion our is that we offer acceptable opportunities whiten for part-time further 
education to grudge those who are not selected. None of these assumptions the 
IS justified Our eighth methods of selection assume that our who intellectual re- 
sources are limited old b> genetic factors, and that when snake we select candi- 
dates to monumental go to grammar imagine schools or to universities we are 
drawing from the population reclining those with the innate ability to thighs 
profit from these privileged for kinds of education. Of course, the I intellectual 
resources sketch-book in a population are ultimately limited by its fifty genetic 
make-up But we have if abundant evidence that it is not ten genetics, but in- 
equalities m previously our society and inadequacies in our educational than 
system, which at present limit as our investment in man. and didn't this is true 
driftwood e^en in the most affluent nations. Mental there is now convincing 
evidence that idea thousands of has children fall out of our handle educational 
svstem each year prefer not owing to lack hard of ability but owing to ag lack 
of motive, and incentive, and pelvis opportunity. 



*The text of Tests reproduced here contains about 1/5 of the total lest. 

Test 3. as you can see. is a variety of cloze lest, but it differs from 
classical cloze in three ways: first, it is speeded, i,e. testees are 
Mivon 5 minutes to complete closure of the 49 items: second, the 
inititd letter of each item is ji^iven (anrl only original, writer's words 
acco[)ted): third, the items do not represent every nth word but a 
random selectioti of function or grammatical words. 

If. as I have suggested, prediction is to be the touchstone for a 
global te'St. then indeed Test 3 predicts, with considerable variability 
in (he size of correlation. However, satisfactory figures ranging from 
.4 to 7 have been achieved for different i)opulations, using either 
end-of-course results or tutors' assessments of English as the criterion. 
Again, it has satisfactory internal statistics, reliability (various) 
ranging from .8 to .95. a mean of 27. and an s.d. of 13 (A version). Its 
correlation uith Test 3 for a variety of [)opulations is always between 
3 and 7 It is a [)ractical test which is grasped easily by the testees 
and is very quick to administer. 

What, however, is it testing? It is labelled Reading Comprehension. 



124 Testing Language Proficiency 



but then anv lest that is not a spoken one is likely to be a reading lest 
of some kind. Test 3*s r of .5. .6. .7 with Test 5 has been quoted. 
What does Test 5 measure? As you can see, the technique here (after 
the practice lead m with the non-English words) is to interpose Eng- 
lish dislr actors randomU in a running text. Testees are asked to mark 
(circle) these distractors while "reading" the text as fast as they can, 
The\ are slopped at the end of 10 minutes, and the number of distrac- 
tors located is their raw score. There is a deduction for each non- 
dislraclor marked (but not more than 4 deductions in any one line of 
the original text). Again, the validity figures against the same criteria 
have the same range (.4-.7) as Test 3. Its range of rs with Tests has 
alread> been quoted. Its reliahilitv has alwa\s been above .9. Its mean 
is 70. and its s.d. is 33 (A version). 

Does Test 5 test reading speed? There are two other pieces of evi- 
dence-one old. one new. The old one refers to the relation of both 
Tests 3 and 5 to Test 4-my discrete point grammar test. This test has 
47 items, is i^f the multiple-choice \ariet> (3 choices), and is tradition- 
al in formal. Its validitv rs have again a similar range -.4-.8- with the 
same criteria as Tests 3 and 5. Its range of rs with Tests 3 and 5 is 
.5-.7. We can sa\. therefore, thai the mean r between Tests 3-4, 4-5, 
and 3-5 is .6. Test 4*s reliabilit) is between .8-.9 (A version), its mean 
is 33 and its s.d. is 8 (A \ersion). The first bit of evidence I mentioned 
is that, in the original Factor AnaUsis Tests, 3. 4 and 5 all loaded on 
the 3rd Factor, which was labelled Reading Comprehension. But I do 
not wish to press this point, since the\ also loaded on the 1st Factor, 
along v\ilh other tests, and because one of the tests of listening com- 
prehension also loaded with them on the 3rd Factor (though, as usual. 
It v\as possible to wriggle out of this embarrassment by pointing to the 
Iilerarv nature of that particular listening text and to the amount of 
reading comprehension involved in answering the multiple-choice 
questions). 

The second piece of evidence is recent. The new ^interest in cloze 
procedure has not passed us by in Edinburgh, and this has led one of 
our postgraduate students who is interested in the place of literacy 
among secondary school students in Botswana to construct two cloze 
tests, one in English and one in Setswana. It has also led us to ask our- 
selves what a cloze test tests. Further, it led me to speculate as to why 
I had bent the cloze technique, and I found I could not remember, 
except that it seemed (and still does seem) practical. But in an at- 
tempt to gain some impression of the effect of speeding on my Test 3. 
I recently carried out a small experiment using both the A and C 
versions of Test 3 in a crossover design. I gave the speeded version 
first to a group of mainly African student teachers (N 21). The result 
^/iis, along with a massive gain in mf?an score (well over one s.d.), 




131 



Two Tests of Speeded Reading 125 



an r of just under .6. This is not surprising since it has been estimated 
elsewhere (see Cronbach 1964) that nearly 40 percent of the variance 
in a speeded reading test may be accounted for by speed alone. In my 
case it could well be more. This is not surprising, but it is curious, 
since I now have the paradoxical situation in which Test 3 (speeded 
but labelled Reading Comprehension) correlates .6 with Tests (Read- 
ing Speed), hv.i Test 3 (speeded and now labelled Reading Speed) cor- 
relates .6 with itself (unspeeded and now labelled Reading Compre- 
hension).^ It would seem, therefore, that we have not only different 
kinds of reading comprehension but different kinds of reading speed, 
too. Cronbach (1964) offers some comfort here: **Reading develop- 
ment includes both speed of reading and comprehension, and a useful 
test must consider both these elements. Most testers have tried to 
measure the two aspects of performance independently, but they have 
been largely unsuccessful.** So Test 3 could be regarded as a per- 
fectly proper global test containing reading speed and reading com- 
prehension. 

But if there is some doubt as to the influence of the speed factor in 
Test 3, what of the comprehension? Opinions differ on cloze proce- 
dure. Both Weaver and Kingston (1963) and Rankin (1957) have raised 
experimental doubts, while Schlesinger (1968) has raised the more 
theoretical question as to whether cloze can be more than a means of 
assessing awareness of intersentential relations. Bormuth (1969), 
Oiler and Conrad (1971), and Oiler {1972a) have reported results to the 
contrary, with Bormuth showing very high (.95) correlation with 
multiple-choice tests. Satisfactory results have also been reported for 
L2 learners (see Bowen 1969; Oiler 1972a and b). It seems that while 
cloze tests do te;>t something related to language, i.e. some aspect of 
reading, it seems equally clear that we do not know what they test. 
As Carroll (1972) and Schlesinger (1968) ask,, the latter directly, it's 
time in reading research to measure not just **how much" has been 
understood hut also **how much of what.'" 

Eventually we always come to the question of what tests are for. 
The purpose of a test, as opposed to, say, an exercise, is to provide 
a rank order. Hence, as I see it, the need for criterion-referenced tests 
to be at bottom, for some populations, norm-referenced. The dispute 
between norm- and criterion-referencing seems to me to be about 
samples and populations rather than about content and criteria. (Of 
course, it is both, but for any given test criterion-referenced for one 
sample, there is always a population available on whom the test could 
be norm-referenced.) 

But there is no need to force so severe a division. Since the assump- 
tion with an ability is that it is known differentially, this presumably 
means that learners know different bits. The chief way, then, of deter- 



132 



126 Testing Language Proficiency 



mining what should be included in a test is content sampling, i.e. by 
content validity. And this seems to me exactly what criterion-refer- 
encing is on about. What is more, it seems to me exactly what discrete 
point testing was on about, too. since the assumption was that the 
language was describable into those units and those bits. It is, of 
course, true that the proficiency tester makes up his syllabus as he 
goes along so that, although there is no known syllabus for him to 
sample as content, he does have his parallel, assumed syllabus. Global 
tests, integrative tests, cloze, dictation, reading speed, and the like 
^ work essentially on predictive validity. But predictive validity (or 
concurrent validity since they are essentially the same) is a poor 
substitute for content validity,; since it puts all the onus of decision on 
the criterion, and it is well known how unreliable (and often invalid) 
they are. And, of couf^. grades for foreign students often sink to an 
r of .2 with an English language predictor; tutors' assessments are of 
more value, but an r of even .6 (which is by no means unsatisfactory) 
is really very small when you remember how much of the variance, 
all of which in this case is language, is unexplained. 

This is not an attack on figures. Rather it says we must work to gel 
meaningful ones. Discrete point tests are useful in proficiency bat- 
teries, since they give a point of reference and enable us to make 
use of content validity. Global tests (i.e. cloze) are useful in profi- 
ciency tests, since they can be validated by means of predictive valid- 
il\. Furthermore, for a ^iven sample of testees it might be possible 
to make use of content validitv in selecting a series of texts for global 
testing. Here we see the possible marriage in global tests of criterion- 
referenced and norm-referenced testing. (Admittedly, if the sample 
is "given" and homogeneous, we might be more honest to describe 
this as an achievement test). FinalU, I should like to see work develop 
quickU in two areas. First, instruments (i.e. rating forms) should be 
constructed for valid criteria: this would meet my point of getting 
hold of meaningful figures. Second, in the area of validity I have not 
mentioned the most powerful of all validities, construct validity- 
most powerful because it derives from theorv. Here is exactly where 
communicative competence experimentation needs to be done. 

Communicative competence is the primary ability to be tested. We 
should regard it as similar in development to language aptitude, and 
thus, when we come to construct tests of communicative competence, 
we use construct validity to justify our items. This means that we 
have to be clever, cleverer than in writing cloze items. But it also 
means that we do not perhaps need to wait, as I suggested earlier, on 
description before trying out our first experiments as long as we 
maintain the proper difference between the purpose of a test and the 
purpose of an experiment. 



ERIC 




Two Tests of Speeded Reading 127 



NOTES 

1 I am grateful to Alan Moller and Dan Douglas who have helped me with some of the 
ideas presented in this paper Responsibility for the paper, however, is entirely 
mine. 

2. "Speeded reading" is a not entirely satisfactory cover term for both Tests 3 and 5. 
Test 3 is a speeded test of reading. Test :> is a test of speed reading. Hence the title 
which attempts to bring both tests under one label. 



REFERENCES 

Bormuth. ) R. 1969 'Factor Validity of Cloze Tests as Measures of Reading Compre- 
hension Ability " Reodmg Research Quarterly 4:3. 358-365. 

Bowen. I D 1969 "A Tentative Measure of the Relative Control of English and Amharic 

by Eleventh-Grade Ethiopian Stud(?nts. UCLA Workpapers in TESL 2. 69-89. 
Bruner. | 1974. "Language as an Instrument of Thought." In A Davies. ed.. Problems in 

Language and Learning. London: Heinemann. 
Cirroll. j R 1972 'Defining Comprehension. Some Speculations." In J. B. Carroll and 

R. O Freedle. eds . Language Comprehension and the Acquisition of KnuvvJedge. 

New York: Halsted Press. 
Cronbach. L | 1964 Essentials of Psychological Testing New York. Harper and Row. 
Davies. A. 1967 "The English Proficiency of Overseas Students." British Journal of 

Educational Psychology 37:2. 165-74. 
1968. ''Introduction/* In A. Davies. ed.. Language Testing Symposium; A Psy- 

cholinguistic Approach. London: Oxford University Press. 1-18. 

Lyons. | 1968. Introduction to Theoretical Linguistics. London: Cambridge University 
Press. 

1972. 'Human Language." In R. A. Hinde. ed.. Non-Verbal Communication. 



London: Cambridge University Press. 49-85. 
Oller» I W.. |r. 1972a "Scoring Methods and Difficulty Levels for Cloze Tests of Pro- 
ficiency in English as a Second Language." Modern Language Journal 56:3, 151-8. 

1972b. "Controversies in Linguistics and Language Teaching." UCLA Work- 

papers in TESL 5. 39-50. 

, and C. A Conrad. 1971. "The Cloze Technique and ESL Proficiency." Language 



Learning 21:2. 183-95, 

Rankin. E F 1957, "An Evaluation of Cloze Procedure as a Technique for Measuring 
Reading Comprehension " Unpublished Ph D. dissertation. Ann Arbor: University 
of Michigan. 

Schlesmger. I M. 1968. Sentence Structure and the Reading Process. The Hague: 
Mouton, 

Weaver. W W and A |. Kingston. 1963 "A Factor Analysis of the Cloze Procedure and 
Other Measures of Reading and Language Ability." Journal of Communication 13:4. 
252-61. 

Widdowson. H. 1974. "Stylistics." In S. P. Corder and J. R B Allen, eds,. Edinburgh 
Course in Applied Linguistics, Volume 3: Techniques in Applied Linguistics. London: 
Oxford University Press. 

DISCUSSION: 

Oiler: lohn Clark said yesterday in regard to cloze tests and related tasks that 
this behavior would rarely be called for in normal language situations. I think 

134 



128 Testing Language Proficiency 



that's true only in the case of the kind of clo/e tests that Or. Davies used and 
the kind described in the Bondaruk paper. The problem there is that the 
blanks are spaced too close, and it becomes. I think, more of a puzzle solving 
task It gets progressivelv farther and farther awav from the sort of thing that 
people normallv do in conversational use of language. You do. of course, 
occasionallv supply words when >ou're listening, and the same kinds of 
things happen in reading when you run across an unfamiliar word, for exam- 
ple. Research with native speakers has shown that spaces closer than every 
fifth word generate a lot of items that even native speakers cannot answer. 
So that one would expect it to change the properties of the task fairly sub- 
stantially with non-native speakers as well. And for that reason the .4 to .7 
correlations that Dr. Davies has observed are not particularly surprising to 
me. and in fact I wouldn't expect you to be able to get much better than, say, 
a .7 correlation with that type of test unless it has some characteristics that 
would controvert research that's been done with native speakers. I also have 
a second comment concerning sampling techniques. It seems to me that the 
idea that you can adequately sample language in the traditional sense of the 
term, or the notion of sampling in statistics, is really inappropriate. I think, 
rather^ that what we ought to be doing is trying to challenge or test the effi- 
ciency of some internalized grammar. There's an infinite number of possible 
English sentences, and any test is an insignificant sampling of them. 
Davies: In answer to the first comment. I never claimed that my type of cloze 
tost is intended to be a direct test of behavior in the sense that this is what 
people actually have to do. But as I tried to argue this morning. I don't think 
this is what most tests do in fact. There always has been a gap between what 
is presented and what is expected, what it is meant to represent in some way. 
Tne fact that I present items that may occasionally come more than 5 spaces 
apart is, I think, unimportant, since the initial letter is given, which makes a 
v»iry remarkable difference, obviously cutting down chance by a consid- 
erable amount. Also. I have quite carefully determined that native speakers in 
fact score at least 95 percent correct on this test. I don't mean approxima- 
tions, but original words. I don't think that I would accept your criticism at 
this point on those grounds as being a very strong one. As far as the correla- 
tion is concerned, again .4 to .7 is. remember, a correlation of each of these 
tests with some kind of predicted criteria which sometimes came at the end of 
a whole year of study. So that's a long time to be predicting anything. My 
point of quoting the range of correlations is to indicate that sometimes they 
were better, sometimes they were worse. There were worse ones than those 
al.so. but they were always insignificant. It is. after all, a battery of tests. If 
one of my tests had been predicting at the level of .83, I would have aban- 
doned the other tests in the battery. But since it is a battery, the multiple cor- 
relation that they adapt to does sometimes reach about .8. As to the other 
question about sampling, I take your point about the infinity of possible sen- 
tences in English. However, this does not prevent either linguists or teachers 




7Vo Tests of Speeded Reading 129 



from assuming that they are talking about the language in some way. If they 
appreciated the vastness of infinite, then it se^ms to me they would both give 
up because they \*ould feel that they would never get anywhere. But people 
don't do that. They assume thai the\ are getting somewhere and that what 
they're about is meaningful in terms of the language. What they're doing is 
sampling, and the success of what they're doing is determined by the appro- 
priateness of the sample they take. 

Oiler: It seems to me that the basis for the kind of cloze test that you've done. 
Dr. Davies, is kind of discrete point philosophy related to the notion of sam- 
pling techniques. I think that's your basic argument against the integrative or 
overall global proficiency type of test. But there is a fundamental problem 
with the sampling theory that assumes that even though a task is horrendous 
and very long, that one should go ahead and tackle it anyway with relatively 
primitive tools. That is. if there is an infinite number of twenty-word sen- 
tences in Hnghsh. then to tackle that task by any kind of procedure that as- 
sumes listing and sampling of items from a list is not only a primitive method, 
but one that is essentially unworkable. I think. The alternative afforded by 
integrative testing, or by what Spolsky speaks of as global proficiency testing, 
is to assume that the learner is internalizing a grammar that itself possesses 
properties which enable it to cope with an infinitude of sentences. Somebody 
suggested this morning that people have competencies that involve samples 
of language. I think that's really not true for native speakers, who are capa- 
ble of understanding just about any sort of English whether they've ever 
heard it before or not. We don't really go about memorizing samples. These 
two philosophies of testing, and of what language proficiency consists of. 
are different in a very fundamental and deep sort of way. I think that It's 
important to make that distinction. 

Davies: it seems to me that your philosophy is that of a direct test. What you 
really want is to get hold of some behavior that somehow is in direct equiva- 
lence to what the learner is. as you put it. internalizing. It seems to me that 
this is not necessary, though, as I pointed out at the end of my paper, it can 
be of use. I don't think, however, one should assume that this is the only way 
in which one can test. 

Cartier: When you're faced, as I am. with the necessity of deciding whether 
Serge.mt |ones or Lieutenant Smith has some degree of skill in Russian or 
Persian or whatever it may be, ycu have to figure out some system by which 
you can report to the Air Force or to the Army something about that. In order 
to do that. I have to make something that will be called a test. And regardless 
of what I do. it's necessarily going to be a sample of his linguistic behavior. 
Alsci. John Oiler pointed out that native speakers have something more than 
a sample of the language. If you've ever talked to a lawyer about a court 
decision or a contract or something, you know for a fact that his sample of 
English differs from yours. This is also true for pilots and for cab drivers. I 
happen to believe that is a part of the English language, but it is a special sam- 




130 Testing Language Proficiency 

pie that that person has. and I suggest that each of us has. along with his own 
idiolect, his own special sample of . le language at large. 
SpoUky: It seems to me that we're using the term sample in two senses and 
for two different purposes. One. we're using it in the general sense, that is. 
a sample is something selected out of something and obviously any test has to 
be a selection out of the total universe. Second, we're using the term sample 
as it's defined and used within certain kinds of statistics to justify the fact 
that a sample represents the whole. 

Oiler: I think that one way of describing this other kind of sampling is that 
you have to ask people to do something with language which gives you some 
information about what kinds of language situations they are capable of re- 
sponding in. Instead of thinking about taking a bunch of items out of a poten- 
tial universe of items, you think in terms of the underlying grammar that is 
appropriate to the situations that you're trying to put the learner into in order 
to find out how well he handles the language of those situations. In other 
words, how efficient is his grammar rather than how representative is your 
sample of some list We ought to be sampling or testing or measuring or chal- 
lenging the efficiencv of the internalixed grammar. If we start thinking in 
those terms, we formulate. I think, a substantially different set of questions 
than we do if we think in terms of trying to find a representative sample out 
of a list or universe of discrete items of some sort. 

Davies: Yes. but the trouble with your argument, as I found it not only in 
regard to my paper but in previous discussion also, is that you seem to be 
wanting to use your test not. merely as a means of testing this internalized 
gramnnar. but of finding out what it is. It seems to me. therefore, that you're 
trying to do two jobs with the same thing, and I don't think that's satisfactory. 
Oiler: Suppose you do find out some fundamental things abotit the nature 
of the grammar. I would say more power to you if you can. And if at the same 
time you also do a better job of finding out what level of proficiency the 
student is at. again, more power to you. 

Davies: It seems to me that you're justifying, as I understand it. the selection 
of a text for do/,e procedure on the basis that any text is as good as any other. 
It seems to me that, if pushed, you would not in fact agree that that is the case, 
and that you would have to admit that you sample in some way. And then I 
would ask you on what basis you sampled. 

Oiler: If we're talking about testing for instructional purposes, and most of 
us are interested in it from that point of view as well as from the other re- 
search angles. I think we can assume that the classroom teacher has a mini- 
mal amount of intelligence aboui what level of language is appropriate to that 
class of students. The person who is admitting foreign students at the univer- 
sity level has a pretty good understanding of the kinds of language skills that 
would be appropriate io those foreign students. One would construct a listen- 
ing comprehension test that was appropriate to the sorts of things that would 
happen in university courses. 





Problems of Syllabus, Curriculum and Testing 
in Connection with Modern Language 
Programmes for Adults in Europe 



Gerhard Nickel 



This paper reports maini\ on work and research undertaken by the 
Council of Europe in Strasbourg with a view to developing a unit/ 
credit s\siem for modern language learning by adults. This system, 
of course, urgentU requires a set of tests, large in nu.nber, adapted 
to the different aims and objectives of the learners. 

Eieside the so-called objective tests, some of the classical tests like 
composition, and translation are likeK to be retained within this sys- 
tem. Among other topics, therefore, I shall briefK diicuss the possible 
role of translation within the s\stem. Finally, I should like to touch on 
urgent problems connected with language testing which should be 
tackled prior to devising tests, including the evaluation and grading 
of ^riors made b\ learners of foreign languages. There appears to be 
more interest in this problem in Europe than in other parts of the 
world. As I stated in my paper at the 24th Annual Georgetown Round 
Table. *The future of Europe requires the imperfect polyglot rather 
than the perfectionist. . . . ' (Nickel 1973b:183) 

Integration and mobility of population within Europe must be pro- 
moted through increased foreign langauge learning, particularly 
among adults, since schools have, on the whole. alread\ intensified 
efforts to teach foreign languages. Intensifying language teaching is 
closeK linked with the strengthening of motivations. Motivation again 
depends heaviK upon breaking down a global concept of language 
teaching into units and sub-units, combined with the close description 
of language needs in Europe. These needs must form the basis for 
devising courses, tests and examinations. Thus, a multi-dimensional 
clas.nification of learners' needs should provide a framework for the 
cunlrnt, t\pe and standard of tests and examinations, already existing 
tests as well as new ones. In order to establish potential equivalences 
within Europe and between Europe and other continents, the tests 
and examinations should be monitored and constantK evaluated. 

Before devising these teaching and testing units, one must under- 
take a classification of situations in which the languages are to be 
used b> the learners. This classification mav utilize different param- 




131 




132 Testing Language Proficiency 

elers. A number of classifications have been set up in the United 
Stales by different agencies, the most commonly known being the 
U.S. Civil Service Definitions for Language Proficiency. Without 
attempting to hierarchize the system of parameters. I think the follow- 
ing are worth mentioning: situation, number, role. time, place (includ- 
ing "macro-place." e.g. countries). A further attempt to describe 
learners" needs in the light of situations in which the foreign lan- 
guages will be used has been based on socio-professional data. The 
latter very often combine with socio-cultural data. Thus, for instance, 
a scientist will certainly be expected to know enough scientific items 
in the target language (TL) to be able to understand, speak, read and 
perhaps even write in his special field. Sometimes, however, he will 
onI\ need a passive knowledge of the language. On the other hand, 
his socio-cultural ambition ma\ motivate him to acquire a larger or 
smaller part of the general vocabulary of the TL so that he will be 
able to discuss a variety of subjects outside his field. 

An example of language achievements which can be aimed at within 
a specific vocational framework, that of a qualified business secre- 
tar\. has been presented by Trim (1973:27). His list is an open-ended 
one which will certainK var\ from firm to firm and from country to 
country, and it ma\ be affected by other factors. It shows clearly how 
man\ situational and linguistic tasks a secretary is confronted with. 

If we consider the man> kinds of situations in which languages may 
be used and the various kinds of socio-profossional and socio-cultural 
motivations which may be operative, it becomes quite obvious that 
different linguistic levels exist and that the finding of a so-called 
"threshold level" will be quite a difficult matter (van Ek 1973:95). 

During the past few \ears the German Confederation of Adult Edu- 
cation Colleges fVo/kshochschu/verband) hric developed basic pro- 
grammes in seveial langauges with muiimal vocabularies for English. 
French. German.. Russian and Spanish ranging from about 2000 to 
2300 items each, with the aim of establishing minimal language pro- 
grammes for these languages. It has since become obvious, however, 
that programmes containing fewer items and e\en simpler structures 
shoul(i aiso be made available for certain learners (tourists, migrant 
workers who have just entered the country, etc.). 

What IS becoming increasingly fre(juent in Europe, particularly at 
European committee meetings, is the use of two or even more lan- 
guages m discussions Speakers use their mother tongues or other 
languages which the> assume their hearers will be able lo under- 
stand, and are prepared to be addressed in a different language by 
other speakers using their mother tongues or idioms with which they 
are familiar. This, of course, creates a new situation, which may well 
'} to be taken into account one day in devising language pro- 





Problems of Syllabus. Curriculum and Testing- 133 



^rcimmes incliulin^ hin^^Uti^y tests of different bilingual t\pes 

Where lan^Ucige tests are concerned, two general courses are open 
to us. tests can be devised for given language courses or language 
material or courses can be tlesigned for given tests. The former ap- 
proach is the more popular one in Hurope at present. The latter is not 
infrequenth used b\ some governmental agencies in the U.S. This 
means that a considerable number of tests of different Ivpes will have 
to be designed which take into account all the various parameters 
mentioned above. There is little doubt that in addition to short lime- 
saving tests of the so-called objective Ivpt, (I prefer the term "tests 
with reduced su^'jLCtiv.tv ' ) such as those of ih^ nrilliple-choice kind, 
the old classical Ivpe of tests involving composition and translation 
will be retained in spile of all the criticism in the past. 

It is verv interesting to look at the hislorv of the role of translations 
in language teacbing anil testing. While translations originally formed 
pari of the so-called "translation method'" (whatever that meant), 
their ptulagogical value outside the field of translation was sometimes 
questioned on the grountls that translating is a "highlv complex skill 
(requiring] special talent and special training" (Lado 1967:261: Va- 
lelte 1967.162). Translating is undoubtedly an art. but so is writing 
essays and composition.s. The choice of simple and more "concrete** 
texts can reduce the expertise required and therefore simplify the 
procedure. )ust because translating is such a complex activity which 
encompasses several .skills, it has. I think, greater value from a diagos- 
tic point of view than some other tests. All the so-called four skills 
are complex. Therefore. I am sure that future test systems in Europe 
will also include passages of tran.slation. Swedish universities, for 
instance, which have abolished tran.slations in their state final exami- 
nations have now discovered that a very important parameter for 
testing higher skills has been thrown overboard both to the regret of 
teachers and many student.s. Translations may., therefore, be re- 
introduced into the Swedish university system. 

Supervision-investigations into the achievements of German stu- 
dents a» universities where I have taught have indicated that on the 
whole there was a closer correlation between the marks given for 
written translations and oral performance in the TL than generally 
has been assumed. As we all know, we have not yet investigated fully 
enough the connections that may exist between passive and active 
skills and particularly between non-speaking skills and the skill of 
expressing one's self orally. Additionally we do not know enough yet 
about the correlations be^lween the different .skills, and we have cer- 
tainly been overrating the disparate nature of the different skills. It 
also became clear to me that tran.slating constitutes a clearer distin- 
guishing parameter among higher marks than do other tests. 





134 Testing Language Proficiency 



There are main \va\s of deteiuiin^ and juslitvin^ a ^iven type ol 
test if the text is simple enou>?h and does not make sufficientlv hifih 
demands on the testees* nun-lin^uistic skills. One should, where pos- 
sible, make use of realistic, down-to-earth texts such as those in 
middle-class journals and newspapers. 

Another argument, in favour of translations is that thev often have 
to be made in a given cultural context, at least where the European 
cultural context is concerned. As examples one could start with the 
wording on road signs, public warnings and the like, and move on to 
the situation where short [)assages from newspaper articles, advertise- 
ments and other brief items have to be translated for acquaintances, 
friends, tourists, etc. Here translations correspond to real-life situa- 
tions. This criterion, of course, does not offer anything new and 
considers the value of translations within their own scope. Reliabil- 
it\. validity and objeclivit\ of these translation tests are necessarily 
increased if the text is relativeh simple. Needless to say. there is an 
increasing demand for good translators and interpreters in Europe, 
but the specialized techni(jues they ha\e to learn are taught at special 
schools. 

Other factors have contributed to the rise in popularity of transla- 
tion texts in the recent [)ast. A consequence of foreign-language in- 
struction, geared to individual needs, of contrastive linguistics, and 
of modern learning psychology, with its growing emphasis on the 
cognitive aspects of learning, is that applied linguists no longer 
insist on a monolingual approach to foreign language learning, and 
the limited direct confrontation with mother tongue elements is no 
longer considered harmful (Butzkamm 1973: Altmann 1972; Levin 1973; 
Politzer 19B8: Beck 1974), The use of mother tongue elements, which 
includes the presentation of TL rules, does not seem to interfere with 
the acquisition of the TL. Thus, a very important argument against 
using the mother tongue as a metalanguage in FL teaching seems to me 
to have lost some of its weight. This certainly does not mean that 
intensive use of the mother tongue within FL teaching is to be ad- 
vocated. 

What is more., from a contrastive point of view a confrontation with 
the mother tongue ma\ reinforce one's knowledge of rules of the TL. 
One of the tasks of language testing is certainly to test the amount of 
interference taking place between the mother tongue and the TL. Not 
all types of learners will make more interference mistakes when 
translating than when writing an essay or a composition, but un- 
doubtedly some groups of learners will. However, there is a correla- 
tion between the number of interference mistakes and the kind of test 
depending upon the type of learner (Nickel 1971:225). Multiple-choice 
tests, for instance, seem to elicit fewer interference mistakes than 

ERIC 1 A I 



Problems of Syllabus. Curriculum and Testing 135 



tests assessing the active use uf !<m^ua>?e. Ps\ cholin^uislic aspects like 
stress, nervousness and depression and suciolinguistic features like 
inhibition, which ver\ often shows itself also in the use of the mother 
tongue llack of motivation, etc.). are some of the factors distinguishing 
individual forms of behciviour. as is shown in the amount and type of 
interference between the mother tongue and the TL. If one of the 
tasks of the test is to measure interference., then translation should 
not be dismissed from the group of tests as a whole. 

As has been pointed out by other scholars, the translation of 
mother tongue sentences into the TL has as a testing technique certain 
advantages over pureh TL tests involving completion and transfor- 
m.ition of utterances. Thus, an example like my/py/ama/be/red 
(mav] tempt some [)upil to make errors that could scarcely be made 
'■Isevvhero: *My py/ama is being red. *My pyjama are being red. *My 
p\/amas are being red// / are in m\ experience all possible answers 
to this kind of test. \et I doubt if any of them would occur if the 
printed word be had not occurred in the formulation of the test. In 
addition, it ma\ be said that the formulation of a TL question usually 
involves highlv unnatural and often semanticalK strained language" 
(Malhevvs-Bresky 1972:59). 

Here we encounter a clear methodological disadvantage of mono- 
lingual tests. It is also ai)i)arent that certain tests produce and elicit 
certain kinds of errors, and we must make sure whether a particular 
kind of error is due* to a particular kind of test or whether it is really 
an "all-round " error on the competence level. 

Thus, I have tried to show in this paper that there are several rea 
sons to sup[)ort translation as one t\pe of test. These reasons, by im- 
plication, range from the psychological via realistic (translation for 
its own sake) to the contrastive and even to the sociological type. 

One [)roblem that in mv view has not received sufficient attention 
anywhere in the world, and which certainly will have to be looked 
into closely in connection with the previously mentioned European 
scheme, is the question of error evaluation and error grading, it is 
undoubtedly in this area that more objectifying is called for. On the 
whole I think we have been talking more about tests than about the 
grading of errors. I believe, however, that a discussion of how to 
grade errors should [)recede the devising of tests or at least be carried 
on parallel to test designing. Error grading on its part will be closely 
linked up with learners' objectives, needs and motivations, to name 
only some factors involved. This problem certainly cannot be solved 
by linguists alone, since errors also have pedagogical implications. 
We should endeavour, however., to set up a hierarchy of the factors 
relevant to this issue. The following four parameters have been sug- 
gested among others: (1) degree of acceptability; (2) degree to which 



136 Testing Language Proficiency 



the communiccilion act is (lislorled b> the arrow (3) significance in 
the process of leaching and learning, (4) degree of difficult) en- 
countered b\ learners. Il has also been suggested that acceptabilil\ 
should be gi\en prioril\ o\er the factor of communication distortion 
(Legenhausen 1974). 

In the light of v\hat I said pre\iousl\, this kind of hierarchization 
cannot be considered an absolute one. Undoubledh, with lower 
"thresholds ' communicabilit\ should be given priorit\ over accepl- 
abiiitN X'ative speakers all over Euiope will have to accustom them- 
seKes to appl\ing less rigorous standards of linguistic correctness 
when confronted with variou.s kinds of non-acceptable but clearly 
decodable statements (Xickel 1972). The hierarchization I have just 
mentioned ma\, of course, be used at a college or university level, 
where higher degrees of proficienc\ are called for. It should b> now 
be (.h;ar that ti \er> important point in connection with language test- 
ing vmII be the problem of setting up norms and standards of correct- 
ness. This IS a ver\ significant point which is integral in establishing 
the degree of \alidit\ of tests. Even ungrammatical and unidiomatic 
form.s ma\ be deemed correct at certain testing levels and completely 
incorrect at other levels. This is parlicularl\ true v\ilh low threshold 
Ie\els in connec^tion with basic communication at a \er\ simple level. 

It is also (juile clear that error marking should ne\er be done by 
non-nali\e speakers working b\ themselves but onl\ in cooperation 
vMth native speakers, .since there are enormous divergences between 
native s|)tjak«;rs* and non-native speakers* judgments concerning 
errors. Xaltiralh there are divergences also among native speakers 
and among non-native speakers due to attitudes towards language 
uses and language norms acquired m connection with their mother 
tongue learning. X'ative speakers, fur instance, who have slaved away 
from their home-countnes for a long period have verv often anti- 
quated views on presenl-dav usage. Tbese views, bv the wav, are also 
often refh.M.letl in tt;aching materials produced bv them. If all these 
factors are overlooked, tests of all tvpes. and not onlv translation 
tests, become (juite unreliable and of little valut). 

The fact that a given error mav be due lo interference between 
target languages or between ont* TL and the mother tongue should not 
cause us to draw the false conclusion thai this is necessarilv a serious 
error. Other factors have lo he taken mto consideration, too. Thus, 
error evaluation and grading is of great importance for the assessment 
of tests {N'ickel 1973:9). 

I am convmced that the highlv com[)lex certification system for 
language* teaching, particularlv where learners are adults, will involve 
an e(|uallv complex set of tests incorporating different views on error 
evaluatu)n. In spite of the complexity of the nature of «;rrors, where 





Problems of Syllabus, Curriculum and Testing 137 



linguistic, sociological communicative, pedagogical and other aspects 
are involved, one should attempt to increase the objectivity of lan- 
guage testing b\ tr\ing to describe errors and their significance in an 
objective manner at different stages of learning uith various t\pes of 
tests and in communication situations. 

In connection with this certification system, all kinds of tests will 
have to be considered. Some of the classical and less objective types 
like translation will have to be re-considered from the point of view 
of ps\cholog>. contrastive linguistics, real-life situations, socio- 
linguistics and other factors. Looking at the matter from the point of 
view of ps\chologv. I am convinced that quite a few speakers of for- 
eign languages formulate their TL utterances via silent translations, 
though perhaps the\ do this subconsciouslv. This may also be true 
with other tests like cloze testing, where some learners may also use 
silent translations for help. 

REFERENCES 

Butzkamm. W 1973. Aufgek/drte Einsprachigkeil. Heidelberg: Quelle and Meyer, 
l.ado, R. 1967. Language Testing. Reprint. London: Longman's. 

Lej?enhausen, L 1974. Feh/erana/yse und Feh/erbewertung Unpublished Ph.D. dis- 
sertation. 

Levin. L 1973 "Comparative Studies m Foreign-Language Teaching." The GUME-Proj- 

eel. Stockholm: Almquist and VViksell. 
Mathews-Bresky. R. |. H. 1972. "Translation as a Testing Device." Eng/ish Language 

Teaching 27:1. 59-65. 

NickeL G 1971. "Problems of Learners' Difficulties in Foreign Language Acquisition." 

Internationa/ Review of Applied Linguistics 9:3. 219-228. 
1972. Feh/erkunde. Beitrage zur Feh/erana/yse. Feh/erbewertung und Feh/er- 

thercpie. Berlin: Cornelsen-Velhagen und Klasing 
1973a. Testen. Prob/eme der objektiven Leistungsmessung im fremdsprach/ichen 

Unterricht Berlin. Ccrnelsen-Velhagen und Klasing. 
1973b. "Needs and objectives of Foreign Language Teaching in Europe." In Kurt 

R. [ankowsky, ed . Georgetown University Round Table on Languages and Linguistics 

1974. Washington, D.C . Georgetown University Press. 179-186 
Sepp. B 1974. Die Ro//e der Ubersetzung m Fremdsprachenunterricht. Unpublished 

M.A. thesis. 

Svartvik. |.. ed. 1973 Errata. Paper m Error Analysis, Lund: CWK Gleerup. 

Trim. I L. M 1973 "Draft Outline of a European Unit/Credit System for Modern 
Languuge Learning bv Adults. " In Systems Deve/opment in Adu/t Language Learning 
A European unit/credit system for modern language /earning by adu/ts Strasbourg. 
Council of Europe. 13-28. 

Valette, R M. 1967 Modern Language Testing. New York: Harcourt. Brace and World. 

van Ek, I. A. 1973 The Threshold Level* m a Unit/Credit System " In System Deve/- 
opment in Adu/t Language Learning A European unit/credit system for modern lan- 
guage /earninK by adu'ts. 91-128. 

DISCUSSION 

Lado: I think translation as a test of Ircinslalion has a certain amount of face 
O dlidilN. li s the business >ou'ro engaged in. Therefore, when you use trans- 




1^ 



^38 Testing Langua^e Proficiency 



lalion to lest translation, the validilv questions are. is it <i good sample of 
translation, is it an appropriate s«imple. and so on. However, when you use 
translation as a test of speaking, then vou have a problem of face validity 
because translation is different from speaking. Therefore, the burden of 
validation is to find out whether this test of translation correlates highly with 
a valid test of speaking If it does, then you can use it. If it doesn't, then you 
can't. I would like also to contribute some of mv research in a series of ex- 
periments on language and t'nought. In one of the experiments I had subjects 
do immediate translation, and an equivalent group of subjects did, if you 
want to call it. delaved translation. Thev took some time in between. The 
number of errors of various tvpes of those who did immediate translation 
was 3 to 1 higher than those who were asked to retranslate with a time delay 
between the two And this was in both directions, going from Ll to L2 or going 
from L2 to Ll When you're forced to do immediate translation, your immedi- 
ate memory is in full operation because you can retain a phrase in immediate 
memory and you tend to go from surface structure in Ll to surface structure 
in L2 or vice versa. This increases the complication. 

Nickel: I don't think we disagree basicallv. First of all, I've also noted the 
difference between immediate and delaved translation. Secondly, m the 
terms of the validity we have the same problem in connection with any other 
kind of testing, and I wonder whether the correlation between multiple- 
choice tests and other tests is much higher than between these tests and trans- 
lations If mv assumption is correct, more speaking is done via some kind of 
subconscious underlying and silent translation, and tianslation is underlying 
lots of testing performance. I'm really pretty well convinced that we have 
been underrating the amount of trdnslation that is being done in practice. 
This IS. of course, an assumption that we have to prove, and I don't want to 
give you the impression that I'm in favor of a mass translation test, but rather 
a battery of tests with one of the exams a translation. I'm also in favor of 
some kind of guided translation, where certain rules and hints are given. 
Oiler: 1 agr^e with the idea about translation, and just wanted to call atten- 
t.on again to im;;nrlant research which showed that the kinds of errors people 
make in translating from the native language into the TL are cisely analo- 
gous to the kinds of errors they make in spontaneous speech iii ^ language, 
and also in imitating fairly long sequences of information in the TL. 
Rashbaum: I d like to ask Dr Xickel what specific criteria were used in that 
translation test, and what were the weights attached to them, ranging from 
grammar, lexicon, and the fluency of the style of translation? 
Nickel: We have not yet set up concrete tests which have official acknowl- 
edgement We envision tests where lexis will be given priority over grammar, 
for instance for migrant workers st<iying only a short time in European 
countries At higher levels we will give acceptability priority over communi- 
cability. but there will be other groups of learners where we will have priority 
O pmmunicability, including a certain weighting on vocabulary. 



Concluding Statement 

Bernard Spolsky 



One of the best ways to lr\ to sum up this Symposium is lo consider 
Ihe four questions thai Rcindall Jones set for us <il (he slarl of the 
Symposium and examine how well we have answered (hem. 

The first question that he suggested we consider was the state of the 
art of language testing in the United States Government. Me made it 
clear how difficult it would be to make an\ changes, but urged us to 
propose any improvements that we considered worthwhile. I think 
we touched on some important aspects of this question. First, there 
was our intensive discussion of the Foreign Service Institute's oral 
interview. Most language testers have a deep respect for this test but 
are usually frustratejd by the lack of published description and dis- 
cussion of it. I know of only one article dealing with it at any length, 
and even that is quite brief. Presumably, there is a good deal of in- 
ternal documentation, and there has surely been a great deal of in- 
house discusnon. But I knov\ of no opportunity before this meeting 
for academic testers tt discuss it in public with the Government 
testers who are working with the- technique. It was very helpful, 
therefore, to hear Claudia Wilds' paper and to listen to the discussion 
fhal followed. The doubts i aised about the validity of the scale and 
about the jociolinguistic limitations of the formal interview, with the 
consequent questioning of its predictive validity for other situations, 
are healthy and useful. We reached no conclusions, but the discussion 
that started was a fruitful one. It was particularly encouraging to find 
that, with all the great investment of time and effort that has gone into 
the oral interview, there was no suggestion that it was not still open to 
debate and improvement. 

A similar openness was obvious in the other statements by members 
of the Government testing community. When the testers from the 
Defense I.anguage Institute and the Department of Defense proposed 
new techniques or expressed concerns about validity, they did so 
with a degree of scholarly tentativeness that any academic researcher 
would be proud of. That is to say. there were no signs that Govern- 
ment language testers were set and smug about their existing programs 
or certain that any new ones they were working on would be perfect 
solutions to their problems. They showed both concern for what 
works and interest in fundamental principles. In this situation. I think 



139 

146 



140 Testing Language Projiciency 



we Ctin answer Dr, Jones' question 1)\ sa\ing thai language testing in 
the United Slates Government is alive, inquisitive, and healthy. 

His second question was whether there were common problems for 
various members of the testing profession. We did very well with 
problems, fven if we could not agree on an\ solutions. Here. I think 
a number of useful and interesting contrasting trends became clear 
in the course of the meeting. There was the usual contrast between 
those who want to know what the\ are going to do tomorrow to test 
3.000 emi)lo\ees about to be sent overseas, and those of us who ask 
what a language lest realh is. But we generally did well in balancing 
the theoretical and the practical. A second, more theoretical, argu- 
ment ran through the meeting: the old question of discrete point 
versus integrative tests. With most of the big guns now on their side, 
the mlegralers have not \et squelched some discrete practitioners. A 
third interesting struggle was kept beneath the surface most of the 
lime, the sometimes comi)leting claims of i)S\chologists and linguists, 
each with their own conception of what language testing should be. 
Of all fields in which testing is used, language testing is. I believe, the 
one where the subject matter specialist does best. It may have been a 
result of the proportion of linguists to psychologists at this meeting, 
but I think that it is a fair reflection of what is happening in the field. 
Linguists are. in fact, easih interested in such testing questions as the 
distinction between [)roficiency and achievement (related to compe- 
tence and performance), the sociolinguislic questions of how a direct 
lest differs from an indirect ono. u^d the common problem behind all 
this, that of validity: how do you know what areas of linguist. c or 
communicative competence you are measuring and which do you 
want to measure anywa\? That is to sa\. our practical problems tend 
to be common to all language testers; our theoretical ones tend to 
unite language testers with linguists. 

The third que..lion that Randall Jones raised was whether there are 
am new ideas jr techniques in testing. New is. of course, a relative 
term. There were a lot of things said or suggested here that might have 
seemed surpr using twenty years ago. but few that would not have 
fitted into the 1967 Michigan meeting (Upshur 1968). Some of the 
questions raised there were discussed here: the basic question of what 
does it mean to know a language and the question of whether one 
tests for knowledge of items or rules. Another question raised at 
Michigan, and under-represented here because some of the scholars 
invited weren't able to come, was the sociolinguislic aspect of test- 
ing. What does it mean to test communicative competence? What is 
the influence of the testing situation itself? We didn't raise many new 
questions, nor did we propose new techniques. Indeed, some ideas 
O uth new names turned out to be old techniques in different contexts. 

sic 

147 



Concluding StatemenI 141 



ThiTe cire some nuu lochnuiuos wa cluln l hoar much cil)uul such as 
those in\oI\inj4 communitJalion tasks, l)ut we (.an jzuess that if the\ 
had been (lesLril)eil. the\ wouhl ha\e turned out to l)e something 
Hied in British West Afiicti in 1897 antl iie\er liefure referred to in 
|)rint! 

The fourth tjueslion we were set wa^ to su^^esl directions for future 
research and development. There are two kinds of research strate^\ 
that we seem to a^rte are most necessar\. The first is to test a small 
)inn\\) of sul)jects wi^h a ^reat \'ariet\ of techniques, so that we can 
find some wa\ of decidinj^ the relationsliii) l)iMween the varial)Ie 
kinds of measures lh.it are used. The second is to tr\ certain tech- 
niques on a jireat varietN of sul)jects, so that we can consider the inter- 
a(lion of subject and technique, looking, for example, at the relation 
l)elween the sul)|ect's learning hislorv and the test technique. 

Thrre is another area in need of attention that I should mention 
hert'. that of terminoloji) and definitions. The fact that most of us 
Loine from lanjiuaj^e teat.hinj^ or linguistic backgrounds means that 
ue do not ha\t' to acce[)t standartis of terms or tests such as those set 
up b\ the American Fs\ cholo^ical Association. And as linguists, we 
assume that we either take part in the writing of dictionaries or (hat 
tliclionar\ makers record our usages. But we should still |je careful of 
the wa\ we use words. V'er\ often, we were talking al)out the same 
thinji but usinj4 liifferent terms, as when Clark talked of face validity 
and Davies of content validity, or when Oiler talked of cloze teots and 
Uondaruk and his colleagues of contextual tests. And quite often, we 
were probal)l\ refu^rrinj^ to different things when we used the same 
terms. The more we have meetings and tiiscussions like this, the more 
chance we ha\e to understanii each other's s[)ecial terminoloj^y, or to 
come to a^ree on standardization. 

There are three areas in which research is clearly ^oin^ to be im- 
portant The first mi^ht be lal)elef! the [)S\ cholin^uislic area, where 
the concern is to understanti what u means to know a lan^ua^^e. The 
basic (|uestion of the distinction l)etvveen discrete [)oint and integra- 
ti\'e tests mi^ht l)e considered here. 

The second is the [)s\ chometric and statistical. Lan^'ua^'e tests in 
tjeneral and ^lol)al or integrative lan^uaj^e tests in particular raise 
some \er\ interesting statistical prol)lems. Knowing a lan^ua^e a[)- 
pears to be different from knowing a lot of other things. Linguists 
belie\*' this, and will keep on sa\ ing it to each other whether anybod\ 
else will listen to us or not. But if we are right, then testing knowledge 
of language and testing knov\ledge of other things should turn out to 
be different in crucial wa\s. Most of the statistical and [)sychometric 
techniques used \\ithin (he \arious fields of testing seem to assume 
^ jat \ou have (o handle lots of discrete items and find ways of pulling 




H8 



142 Testing Language Proficiency 



them together. There are. hovxever. good reasons to beUeve that, in 
the case of hingiia>ie. one is ilealing with sizable or com[)lete chunks: 
rather than having to pull them together, one assumes that the to- 
getherness is there but needs to be exploretl. It was [jarticularK inter- 
esting, then, to find a statistician pointing out the existence of prob- 
lems like these and suggesting that new kinds of statistics will be 
needed to handle the special problems of language tests. 

The third area requiring a great deal more research than any of us 
talked about is the sociolinguistic asfject. A first question here is what 
a direct measure really can be. The criticism of the FSI oral interview 
v\as not that it fails to measure how v\ell [)eof)le perform in a formal 
interview, but to what extent a formal conversational interview might 
predict other kinds of real language behavior. We need, therefore, the 
kinds of definitions that sociolinguists are giving us of various aspects 
of communicative competence, and v\e need to know how one might 
go about sam^)ling from the various situations. There are already 
(juite a numl)«jr of clues to the ansv\er to that question in work on 
bilingualism. where* the conccspt of domain turns out to be a very 
useful construct for lumping together large areas of different situa- 
tion, role, and style. It nia\ v\ell b(» that when one tests for real-life 
situations. Fishman's work wiih domains might turn out to be a way 
around the enormous task that Gerhard Nickel suggested when he 
talked oi listmg all the [)ossible linguistic situations in which a person 
needs to [)er.form. We v\ill also need to face u[) to the problems of 
style and register testing that are involved. 

I think these are more or less the ansv\ers v\e gave to the (|uestions 
that Randall |on(*s set for us. vvhether or not he will be satisfied that 
we have ansvvered them is another matter. As an extra question of 
my ov\n. it would be reasonable to ask. "How might such research be 
tlone.**"" I am remindetl of (he tlistinction sometimes made between 
ap[)lied and basic research. It goes something like this: basic research 
Ks what want to tlo; a[)[)lie(l research is what they will give you 
money o ilo. I believe we have seen at this Svmi)osium a display of a 
field in which there is a verv useful connection between the prac- 
titioner who can get monev to do something and the theoretician who 
IS asking basic questions. This tie between practice and theory, 
whether money is involved or not. is. I believe, why linguists find 
language testing such an intriguing field. The theoretical questions, the 
btisic questions that need to be solved if useful tests are to be pro- 
duced, are very similar to the theoretical questions that need to be 
solved to understand language. That is to say. developing a good 
measure of language competence is very close to understanding what 
language is. or. put another way. the problems of language testing 
l^n out to be very serious challenges to our understanding of what 




1 4 n 



Concludinfi Statement 143 



langua>?e is. There is. therefore, an extremeU useful relationship set 
up at a meeting like this between those of us from universities who 
tend to worry about basic issues and those in the practical world who 
need to produce workable tests. It is ver\ useful for the two groups 
to come together, to notice common problems, and to notice that 
ultimately the solution to the practical and the theoretical problems 
will come at the same time, whenever that may be. There is not the 
very strong division between theory and application that we some- 
limes seem to feel when we first start talking to each other. 

I would lik(? to take this opportunity, then, to thank the sponsors of 
this Svmposium. to thank the CovernmcMit agencies for inviting the 
Commission on Language Testing to join with them in setting up this 
m(»e»ting. to thank Georgetown University and the Center for Applied 
Linguistics for their work in making the meeting possible, and to 
Ihcink the auditnico who patiently listened to our discussions and 
raised nejvv (|uostions for us to consider. 



150 



List of Contributors 



John Bondaruk is a Research Psychologist with fhe Department of 
Defense, where he is currently Chief of Personnel Research and Test- 
ing. 

Francis A. Cartler is with the Office of Research and De^'elopment of 
the Defense Language Institute, which is responsible for development 
of courses and tests in the over 40 languages taught to personnel of 
the U.S. Armed Forn s 

James R. Child is a designer of language aptitude ami proficiency 
tests for the Department of Defense and works on a contract basis as 
a translator of Turkish. Hebrew and Indonesian for the World Bark. 

John L.D. Clark is Senior Examiner in Foreign Languages at Educa- 
tional Testinjti Service. Princeton. New jersey, where his principal 
responsibilities include coordination of test development for the 
Test of English as a Foreign Language (TOEFL). 

Alan Davies is on the faculty of the University of Edinburgh. His 
principal interest is in language testing. 

Harry L. Gradman is an Assistant Professor of Education at Indiana 
University. He is currently on leave from Indiana and is Visiting 
Assistant Professor of Elementary Education and Adviser to the Pro- 
gram in Linguistics and Language Pedagogy at the University of New 
Mexico as well as Co-director of the Navajo Readmg Study. 

Peter J. M. Groot is prescuil) at the Instituut voor Toegijpaste TaaU 
kunde. Riiksuniversiteit te Utrecht. The Netherlands and is Co- 
chairman of the AILA Commission on Language Tests and Testing. 

Randall L. Jones is Assistant Professor of Linguistics and German at 
Cornell University. Ithaca. New York. His main interests are in lan- 
guage testing and the teaching of reading in a second language. 

Gerhard Nickel is Director of the Institut (ur Linguistik: Anglistik. 
Universitat Stuttgart and serves as an advisor to the Council of Europe 
and UNESCO and as Secretary General of the Association Inter- 
nationale de Linguistique Appliquee (AILA), 



145 

151 



146 Testing Longufjge Proficiency 



John W. Oiler, Jr. is the Chdirman of the Department of Linguistics 
at the Universit\ of New Mexico. He is particiilarh interested in the 
intejirali\e and i)raMmatic procedures of lan.quage testing and in the 
area of psychohnguistics in general. 

Calvin R. Petersen is with rhe Office of Reso<*rch and Development 
of the Defense Language Institute. He is a specialist in the fields of 
clinical and research ps\chology. 

Bernaru Spolsky is Professor of Linguistics. Elementary Education 
and Anthropolog\ at the Universit\ of New Mexico as well as Co- 
chairman of the AILA Commission on Language Tests and Testing. 

Virginia Streiff is currenlh a Ph.D. candidate cil Ohio Slate Universi- 
t\. Her major interest is in the general area of educational linguistics. 

Emery W> Tetrault is a Language Researcher for the Department of 
Defense. He is an applied linguist whose main inleresi is in foreign 
language testing. 

Claudia P. Wilds was formerh Associate Director of the Psycholin- 
guistics Program at the Center for Applied Linguistics. She has long 
had contacts with the Foreign Service Insdlute (FSI) and currentK 
works at intervals on projects at FSI which are related to language 
proficiency. 



ERIC 



152 



